= w 2. w 1. B j. A j. C + j1j2
|
|
- Hester Baker
- 6 years ago
- Views:
Transcription
1 Local Minima and Plateaus in Multilayer Neural Networks Kenji Fukumizu and Shun-ichi Amari Brain Science Institute, RIKEN Hirosawa 2-, Wako, Saitama , Japan ffuku, Abstract Local minima and plateaus pose a serious problem in learning of neural networks. We investigate the geometric structure of the parameter space of three-layer perceptrons in order to show the existence of local minima and plateaus. It is proved that a critical point of the model with H ; hidden units always gives a critical point of the model with H hidden units. Based on this result, we prove that the critical point corresponding to the global minimum of a smaller model can be a local minimum or a saddle point of the larger model. We give a necessary and sucient condition for this. The results are universal in the sense that they do not use special properties of target, loss functions, and activation functions, but only use the hierarchical structure of the model. Introduction It has been believed that the error surface of multilayer perceptrons (MLP) has in general many local minima. This has been regarded as one of the disadvantages of neural networks, and a great deal of eort has been paid to nd good methods of avoiding them. There have been no rigorous results, however, to prove the existence of local minima. Even in the XOR problem, existence of local minima had been controversial. Lisboa and Perantonis ([]) elucidated all the critical points of the XOR problem and asserted with a help of numerical simulations that some of them are local minima. Recently, Hamney ([2]) and Sprinkhuizen-Kuyper & Boers ([3]) rigorously proved that what have been believed to be local minima in [] correspond to local minima with innite parameter values, and that there are no local minima in the nite weight region for the XOR problem. Existence of local minima in general cases is still an open problem. It is also dicult to derive meaningful results on local minima from numerical experiments. We often see extremely slow dynamics around a point in simulations. However, it is not easy to tell rigorously whether it is a local minimum. It is known ([4],[5]) that a typical learning curve shows a plateau in the middle of training, which causes almost no decrease of the training error. It can be easily misunderstood as a local minimum. We mathematically investigate critical points of MLP,which are caused by the hierarchical structure of the model. We discuss only networks with one output unit in this paper. The function space of networks with H ; hidden units is included in the function space of networks with H hidden units. However, the relation between their parameter spaces is not so simple ([6],[7]). We investigate their geometric structure and elucidate how a parameter of a smaller network is embedded in the parameter space of larger networks. We show thatacritical point of the error surface for the smaller model gives a set of critical points for the larger model. The main purpose of this paper is to show that a subset of the critical points corresponding to the global minimum of the smaller model can be local minima of the larger model. The set of critical points is divided into two parts: local minima and saddles. We give an explicit condition when this occurs. This gives a formal proof of the existence of local minima for the rst time. Moreover, the coexistence of local minima and saddles in Moreover, the coexistence of local minima and saddles explains a serious mechanism of plateaus: when such is the case, the parameters are attracted in the part of local minima, walk randomly for a long time, but eventually go out from the part of saddles.
2 2 Geometric structure of the parameter space 2. Basic denitions We consider a three-layer perceptron with one linear output unit and L input unit. The function of a network with H hidden units is dened by f (H) (x (H) )= HX ; v j ' w T x j () j= where x 2 R L is an input vector and (H) = (v ::: v H w T ::: wt H )T is the parameter vector. We do not use bias terms for simplicity. The function '(t) is called an activation function. In this paper, we use tanh for '. However, our results can be easily extended to a wider class of functions with necessary modications. Given N training data f(x () y () )g N =, the objective of training is to nd the parameter that minimizes the error function E H () = NX = `(y () f(x () )) (2) where `(y z) is a loss function. If `(y z) = 2 ky;zk2, the objective function is the mean square error. Another popular choice is the cross-entropy. The results in this paper are independent of the choice of a loss function. 2.2 Hierarchical structure of MLP The parameter (H) consists of a (L +)H dimensional Euclidean space H. All the functions eq.() realized by H consist of a function space S H = ff (H) (x (H) ):R L! R j (H) 2 H g: (3) We denote the map from H onto S H by H : H!S H (H) 7! f(x (H) ): (4) We sometimes write f (H) for H (). A very important point is that H is not one-to-one, that is, dierent (H) may give the same function. It is easy to see that the interchange between (v j w j ) and (v j2 w j2 ) does not alter the image of H. For the tanh activation, Chen et al. ([6]) showed that any analytic transform T of H such that f (H) (x T ()) = f (H) (x ) is a composition of the interchanges and sign ips (v j w j ) 7! (;v j ;w j ). These transforms consist of an algebraic group G H. The function spaces S H (H =0 2 :::) have a trivial hierarchical structure S 0 S S H; S H : (5) On the other hand, given a function f (H;) realized by a network with (H;) H ; hidden units, there are a family of parameters (H) 2 H that realizes f (H;). Mathematically speaking, a map from (H;) H; to H that commutes the following diagram is not uniquely determined. H; H;??y ;;;;! H S H; ;;;;! H;??y H S H (6) The set of all the parameters (H) that realize the functions of smaller networks is denoted by H H = ; H ( H;(S H; )). Sussmann ([7]) shows that H is the union of the following three kinds of submanifolds of H A j = fv j =0g ( j H) B j = fw j = 0g ( j H) C j j 2 = fw j = w j2 g (j <j 2 ): Fig. illustrates these parameters. In A j and B j, the jth hidden unit plays no role in the value of the input-output function. In C j j 2,thej th and j 2 th hidden unit can be integrated into one, where v v 2 is the weight of the new unit to the output unit. From the viewpoint of mathematical statistics, it is also known ([8]) that H is the set of all the points at which the Fisher information is singular. Next, we will see how a specic function in the smaller model is embedded in the parameter space of the larger model. Let f (H;) be a function in S (H;) H; ;S H;2. To distinguish H; and H,we use dierent parameter variables and indexing f (H;) (x (H;) )= HX j=2 j '(u T j x): (7) Then, given (H;), the parameter set in H realizing f (H;) is the union of the submanifolds in each ofa j, B j and C j j 2 (H;). For 2
3 0 w = w 2 0 A j B j C + jj2 Figure : Networks given by A j, B j and C j j 2. simplicity, we show only an example of the submanifolds of A, B and C 2 =fv =0 v j = j w j = u j (j 2) w : freeg =fw = 0 v j = j w j = u j (j 2) v :freeg ; =fw = w 2 = u 2 v v 2 = 2 v j = j w j = u j (j 2)g: (8) All the other sets of parameters realizing f (H;) are obtained as the transform of,, (H;) and ; by T 2 G H. The submanifold is a L dimensional ane space parallel to the w -plane, is a line with an arbitrary v, and ; is a line dened by v v 2 = 2 in the v v 2 -plane. Thus, each function of a smaller network is realized by high-dimensional submanifolds in H. For further analysis, we dene canonical embeddings of H; into H, which commute the diagram (6) w : (H;) 7! (0 2 ::: H w u 2 ::: u H ) v : (H;) 7! (v 2 ::: H 0 u 2 ::: u H ) :(H;) 7! ( 2 ( ; ) 2 3 ::: H u 2 u 2 u 3 ::: u H ): (9) where w 2 R L, v 2 R, and 2 R are their parameters. We see that the images of these embedding changing its parameter span the submanifolds,, and ; respectively that is, = f w ( (H;) ) j w 2 R L g = f v ( (H;) ) j v 2 Rg ; = f ((H;) ) j 2 Rg: 3 Critical points of MLP Generally, the optimum parameter cannot be calculated analytically, and some numerical optimization method is needed to obtain an approximation of the global minimum of E H. One widely-used method is the steepest descent method, which leads to a learning rule dened by (t +)=(t) H((t)) : However, this learning rule stops at a critical point, which () = 0, even is not the global minimum. There are three types of critical point: a local minimum, a local maximum, and a saddle point. A critical point 0 is called a local minimum (maximum) if there exists a neighborhood around 0 such that for any in the neighborhood E H () E H ( 0 ) (E H () E H ( 0 )) holds, and called a saddle if it is neither a local minimum nor a local maximum, that is, if in any neighborhood of 0 there exists a point at which E H is smaller than E H ( 0 )andapoint at which E H is larger than E H ( 0 ). It is well known that if the Hessian matrix at a critical point is positive (negative) denite, the critical point is a local minimum (maximum), and if the Hessian has both positive and negative eigenvalues, it is a saddle. We look for a critical point ofe H in H. Let (H;) = ( 2 ::: H u 2 ::: u H ) 2 H; ; H;2 be a critical point ofe H;. It really exists if we assume that the global minimum of E H; is not included in H;2. Then, we have P z`() = '(u T j x() )=0 j P N z`() ' 0 (u T j x() )x () =0 () 3
4 for 2 j H, where we (y() f (H;) (x () (H;) )) (2) for simplicity. We have two kinds of critical points as follows. Theorem. Let v and beasineq.(9). Then, ((H;) ) for all and 0 ( (H;) ) are critical points of E H. The proof is easy. Noting that f (H;) (x (H;) ) = f (H) (x ) for = )or 0 ( (H;) ), the condition of ((H;) a critical point of E H can be reduced to eq.(). Because 0 = 0, they give the same critical point. The critical point ((H;) ) consist of a line in H if we move 2 R. If is a critical point ofe H,soisT() for all T 2 G H. We have many critical lines in H. 4 Local minima of MLP 4. A condition of local minima In this section, we focus on the critical point ((H;) ), and show a condition that it is a local minimum or a saddle point. The usual sucient condition using the Hessian matrix cannot be applied in this case. The Hessian is singular, because a line including the point has the same value of E H in common. Let (H;) be a pointin H;. We dene the following L L symmetric matrix A 2 = 2 N X z`() ' 00 (u T 2 x() )x () x ()T : (3) Theorem 2. Let (H;) be a local minimum of E H; such that the Hessian matrix at (H;) is positive denite. Let be + or ; in eq.(9), and ; = f 2 H j = ( (H;) ) 2 Rg. If A 2 is positive (negative) denite, any point in the set ; 0 = f 2 ; j ( ; ) > 0 (< 0)g is a local minimum of E H, and any point in ; ; ; 0 is a saddle. If A 2 has both positive and negative eigenvalues, all the points in ; are saddlepoints. Error Γ Figure 2: Error surface around local minima For the proof, see Appendix. Local minima given by Theorem 2, if any, appear as line segments. Such a local minimum can be changed into a saddle through the line without altering the function f (H). Fig.2 illustrates the error surface in this case. 4.2 Plateaus We consider the case where A 2 is positive (negative) denite. If we map ; to the function space, H (;) consists of a single function f (H) 2 S (H) H. Therefore, if we regard the cost function E H as a function on S H, H (;) is a saddle, because E H takes both larger and smaller values than E H ( (H) )in any neighborhood of f (H) in S (H) H. It is interesting to see that ; 0 is attractive inits neighborhood. Hence, any point in its small neighborhood is attracted to ; 0. However, ; 0 is neutrally stable in the direction of ; 0, so that the point attracted to ; 0 uctuates randomly along ; 0. It eventually escapes from ; when it reaches ;;; 0. This takes a long time because of the nature of random walk. This explains that this type of critical points are serious plateaus. This is a new type of saddle which has so far not remarked in nonlinear dynamics. This type of \intrinsic saddle" is given rise to by the singular structure of the topology of S H. 4.3 Numerical simulation We have tried a numerical simulation to exemplify local minima given by Theorem 2, using MLP with input, output, and 2 hidden units. We use the logistic function λ 0 4
5 Target function f(x) Local minima serious plateaus because all points around the set of local minima once converge to it and have to escape from it by random uctuation. It is important to see whether the critical sets in Theorem 2 are the only reason of plateaus. If this is the case, we can avoid them by the method of natural gradient ([5],[9]). However, this is still left as an open problem x Figure 3: A local minimum in MLP. '(t) = for the activation function, +e ;t and `(y z) = 2 ky;zk2 for the loss function. For training data, 00 random input data are generated, and output data are obtained as y = f(x)+z, wheref(x) =2'(x);'(4x) and Z is a Gaussian noise with 0 ;4 as its variance. Using back-propagation, we train the parameter of MLP with hidden unit, and use it as the global minimum (). In this case, we have ( 2 u 2 ) = (0:98 0:47) and A 2 =:9 > 0. Then, any pointin; 0 = f + (() )j0 < < g is a local minimum. Wesetv = v 2 = 2 =2as ( ==2), and evaluate E 2 () at million random points around, which are generated by a normal distribution with 0 ;6 as its variance. As a result, all these values are larger than E 2 ( ). This experimentally veries that is a local minimum. The graphs of the target function and the function given by the local minimum f(x () ) are shown in Fig.3. 5 Conclusion We investigated the geometric structure of the parameter space of multilayer perceptrons with H ; hidden units embedded in the parameter space of H hidden units. Based on the structure, we found a nite family of critical point sets of the error surface. We showed that a critical point of a smaller network can be embedded into the parameter space as a set of critical points. We further elucidated a condition that a point in the image of one embedding is a local minimum. We see that under one condition there exist local minima as line segments in the parameter space, which cause References [] P.J.G. Lisboa & S.J. Perantonis. Complete solution of the local minima in the XOR problem. Network, 2:9-24, 99. [2] L.G.C. Hamney. XOR has no local minima: a case study in neural network error surface analysis. Neural Networks, (4):669{682, 998. [3] I.G. Sprinkhuizen-Kuyper & E.J.W. Boers. The error surface of the 2-2- XOR network: the nite stationary points. Neural Networks, (4):683{690, 998. [4] D. Saad & S. Solla. On-line learning in soft committee machines. Physical Review E, 52:4225{4243, 995. [5] S. Amari. Natural gradient works ef- ciently in learning. Neural Computation, 0:25{276, 998. [6] A.M. Chen, H. Lu, & R. Hecht-Nielsen. On the geometry of feedforward neural network error surfaces. Neural Computation, 5:90{927, 993. [7] H.J. Sussmann. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks, 5:589{593, 992. [8] K. Fukumizu. A regularity condition of the information matrix of a multilayer perceptron network. Neural Networks, 9(5):87{879, 996. [9] S. Amari, H. Park, & K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation To appear. 5
6 Appendix A Proof of Theorem 2 We have only to prove the theorem on +, because the sign ip (v 2 w 2 )7! (;v 2 ;w 2 ) maps ; ((H;) )to + ((H;) ) preserving the local property ofe H. For simplicity, we change the order of the components of (H) and (H;) as (v v 2 w T wt 2 v 0 v 3 ::: v H w T 3 ::: wt H ) and ( 2 u T ::: H u T 3 ::: ut H ) respectively. We introduce a new coordinate system of H. Let ( T 2 b T v 3 ::: v H w T 3 ::: wt H ) be a coordinate system of H ;fv + v 2 =0g, where = v ; v 2 = v + v 2 ; w ; w 2 2 = v + v 2 b = v v + v 2 w + v 2 v + v 2 w 2 : (4) This is well-dened as a coordinate system, since the inverse is given by v = w = b + ; v 2 = ; w 2 = b ; + 2 : 2 (5) Using this coordinate system, the embedding is expressed as ( )=0 2 )=0 also hold. Therefore, rre H ( )= NX z`() rrf(x () ): (8) From Lemma, for any 2 f = () = 0 f () =0 hold =. Then, from eq.(6), we have G 2 ( ) 2 E O ((H;) ) By simple calculation, we can prove A : (9) Lemma 2. For any 2f = 0g, we 2 (x ) =v v 2 2 ' 00 (b T x)xx T : (20) From eq.(8) and Lemma 2, we 2 E ( )=( ; ) 2 2 A 2: (2) Noting E ((H;) ) is positive definite and 2 6=0,if ( ; )A 2 is positive denite, so is G, and if ( ; )A 2 has negative eigenvalues, G has both positive and negative eigenvalues. This completes the proof. :( 2 u 2 3 ::: H u 3 ::: u H ) 7! ((2;) u 2 3 ::: H u 3 ::: u H ): (6) The critical point set ; is a line parallel to the -axis with =0, 2 = 2, b = u 2, v j = j, and w j = u j (3 j H). Let be the component of, and V = f 2 H j = g be a complement space of ;. We have ; \ V =. If is a local minimum in V for any 2 ; 0, it is a local minimum also in H, and if is a saddle in V, it is a saddle also in H. Thus, we can reduce the problem to the Hessian of E H restricted on V. We write the Hessian by G. From eq.(5), we have Lemma. For any 2f = 0g, (x ) (x ) =0: (7) 6
Local minima and plateaus in hierarchical structures of multilayer perceptrons
Neural Networks PERGAMON Neural Networks 13 (2000) 317 327 Contributed article Local minima and plateaus in hierarchical structures of multilayer perceptrons www.elsevier.com/locate/neunet K. Fukumizu*,
More informationError Empirical error. Generalization error. Time (number of iteration)
Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp
More informationThe Error Surface of the XOR Network: The finite stationary Points
The Error Surface of the 2-2- XOR Network: The finite stationary Points Ida G. Sprinkhuizen-Kuyper Egbert J.W. Boers Abstract We investigated the error surface of the XOR problem with a 2-2- network with
More informationMultilayer Perceptron Learning Utilizing Singular Regions and Search Pruning
Multilayer Perceptron Learning Utilizing Singular Regions and Search Pruning Seiya Satoh and Ryohei Nakano Abstract In a search space of a multilayer perceptron having hidden units, MLP(), there exist
More informationSlow Dynamics Due to Singularities of Hierarchical Learning Machines
Progress of Theoretical Physics Supplement No. 157, 2005 275 Slow Dynamics Due to Singularities of Hierarchical Learning Machines Hyeyoung Par 1, Masato Inoue 2, and Masato Oada 3, 1 Computer Science Dept.,
More informationEigen Vector Descent and Line Search for Multilayer Perceptron
igen Vector Descent and Line Search for Multilayer Perceptron Seiya Satoh and Ryohei Nakano Abstract As learning methods of a multilayer perceptron (MLP), we have the BP algorithm, Newton s method, quasi-
More informationDynamics of Learning Near Singularities in Layered Networks
LETTER Communicated by Kenji Fukumizu Dynamics of Learning Near Singularities in Layered Networks Haikun Wei weihaikun@brain.riken.jp RIKEN Brain Science Institute, Saitama, 35098, Japan, Southeast University,
More informationBasic Principles of Unsupervised and Unsupervised
Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization
More informationNeural Network Weight Space Symmetries Can Speed up Genetic Learning
Neural Network Weight Space Symmetries Can Speed up Genetic Learning ROMAN NERUDA Λ Institue of Computer Science Academy of Sciences of the Czech Republic P.O. Box 5, 187 Prague, Czech Republic tel: (4)665375,fax:(4)8585789
More informationNon-Convex Optimization in Machine Learning. Jan Mrkos AIC
Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationMultilayer Perceptron
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4
More informationArtificial Neural Networks
Introduction ANN in Action Final Observations Application: Poverty Detection Artificial Neural Networks Alvaro J. Riascos Villegas University of los Andes and Quantil July 6 2018 Artificial Neural Networks
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationMultilayer Perceptron = FeedForward Neural Network
Multilayer Perceptron = FeedForward Neural Networ History Definition Classification = feedforward operation Learning = bacpropagation = local optimization in the space of weights Pattern Classification
More informationNatural Gradient Learning for Over- and Under-Complete Bases in ICA
NOTE Communicated by Jean-François Cardoso Natural Gradient Learning for Over- and Under-Complete Bases in ICA Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-01, Japan Independent
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I
Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In
More informationIntroduction to Machine Learning Spring 2018 Note Neural Networks
CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,
More informationgrowth rates of perturbed time-varying linear systems, [14]. For this setup it is also necessary to study discrete-time systems with a transition map
Remarks on universal nonsingular controls for discrete-time systems Eduardo D. Sontag a and Fabian R. Wirth b;1 a Department of Mathematics, Rutgers University, New Brunswick, NJ 08903, b sontag@hilbert.rutgers.edu
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationCSC321 Lecture 5: Multilayer Perceptrons
CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3
More informationXOR Has No Local Minima: A Case Study in Neural Network Error Surface Analysis
XOR Has No Local Minima: A Case Study in Neural Network Error Surface Analysis Leonard G. C. Hamey Department of Computing Macquarie University NSW 2109 AUSTRALIA Acknowledgements The author would like
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationLinear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space
Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................
More informationFeed-forward Network Functions
Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationonly nite eigenvalues. This is an extension of earlier results from [2]. Then we concentrate on the Riccati equation appearing in H 2 and linear quadr
The discrete algebraic Riccati equation and linear matrix inequality nton. Stoorvogel y Department of Mathematics and Computing Science Eindhoven Univ. of Technology P.O. ox 53, 56 M Eindhoven The Netherlands
More informationPerformance Surfaces and Optimum Points
CSC 302 1.5 Neural Networks Performance Surfaces and Optimum Points 1 Entrance Performance learning is another important class of learning law. Network parameters are adjusted to optimize the performance
More informationNEURAL NETWORKS
5 Neural Networks In Chapters 3 and 4 we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw that such models have useful analytical
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationArtificial Neural Networks. MGS Lecture 2
Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationNeural networks. Chapter 19, Sections 1 5 1
Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationNeuro-Fuzzy Comp. Ch. 4 March 24, R p
4 Feedforward Multilayer Neural Networks part I Feedforward multilayer neural networks (introduced in sec 17) with supervised error correcting learning are used to approximate (synthesise) a non-linear
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationStatistical Learning Theory
Statistical Learning Theory Fundamentals Miguel A. Veganzones Grupo Inteligencia Computacional Universidad del País Vasco (Grupo Inteligencia Vapnik Computacional Universidad del País Vasco) UPV/EHU 1
More informationArtificial Neuron (Perceptron)
9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where
More information2 Garrett: `A Good Spectral Theorem' 1. von Neumann algebras, density theorem The commutant of a subring S of a ring R is S 0 = fr 2 R : rs = sr; 8s 2
1 A Good Spectral Theorem c1996, Paul Garrett, garrett@math.umn.edu version February 12, 1996 1 Measurable Hilbert bundles Measurable Banach bundles Direct integrals of Hilbert spaces Trivializing Hilbert
More informationAdvanced Machine Learning
Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationVector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)
Vector Space Basics (Remark: these notes are highly formal and may be a useful reference to some students however I am also posting Ray Heitmann's notes to Canvas for students interested in a direct computational
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationContents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces
Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v 250) Contents 2 Vector Spaces 1 21 Vectors in R n 1 22 The Formal Denition of a Vector Space 4 23 Subspaces 6 24 Linear Combinations and
More informationArtificial Neural Networks. Edward Gatt
Artificial Neural Networks Edward Gatt What are Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning Very
More informationIntelligent Systems Discriminative Learning, Neural Networks
Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm
More informationLab 5: 16 th April Exercises on Neural Networks
Lab 5: 16 th April 01 Exercises on Neural Networks 1. What are the values of weights w 0, w 1, and w for the perceptron whose decision surface is illustrated in the figure? Assume the surface crosses the
More informationN.G.Bean, D.A.Green and P.G.Taylor. University of Adelaide. Adelaide. Abstract. process of an MMPP/M/1 queue is not a MAP unless the queue is a
WHEN IS A MAP POISSON N.G.Bean, D.A.Green and P.G.Taylor Department of Applied Mathematics University of Adelaide Adelaide 55 Abstract In a recent paper, Olivier and Walrand (994) claimed that the departure
More informationArtificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!
Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationNeural networks. Chapter 20. Chapter 20 1
Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms
More information100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units
Connectionist Models Consider humans: Neuron switching time ~ :001 second Number of neurons ~ 10 10 Connections per neuron ~ 10 4 5 Scene recognition time ~ :1 second 100 inference steps doesn't seem like
More informationLearning from Data: Multi-layer Perceptrons
Learning from Data: Multi-layer Perceptrons Amos Storkey, School of Informatics University of Edinburgh Semester, 24 LfD 24 Layered Neural Networks Background Single Neurons Relationship to logistic regression.
More informationLecture 6. Regression
Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron
More informationNeural Networks. Volker Tresp Summer 2015
Neural Networks Volker Tresp Summer 2015 1 Introduction The performance of a classifier or a regression model critically depends on the choice of appropriate basis functions The problem with generic basis
More informationMachine Learning: Multi Layer Perceptrons
Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons p.1/61 Outline multi layer perceptrons
More informationRadial Basis Function (RBF) Networks
CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks 1 Function approximation We have been using MLPs as pattern classifiers But in general, they are function approximators Depending
More informationDeep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville
Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationOptmization Methods for Machine Learning Beyond Perceptron Feed Forward neural networks (FFN)
Optmization Methods for Machine Learning Beyond Perceptron Feed Forward neural networks (FFN) Laura Palagi http://www.dis.uniroma1.it/ palagi Dipartimento di Ingegneria informatica automatica e gestionale
More informationNovel determination of dierential-equation solutions: universal approximation method
Journal of Computational and Applied Mathematics 146 (2002) 443 457 www.elsevier.com/locate/cam Novel determination of dierential-equation solutions: universal approximation method Thananchai Leephakpreeda
More informationLearning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract
Published in: Advances in Neural Information Processing Systems 8, D S Touretzky, M C Mozer, and M E Hasselmo (eds.), MIT Press, Cambridge, MA, pages 190-196, 1996. Learning with Ensembles: How over-tting
More informationMULTICHANNEL BLIND SEPARATION AND. Scott C. Douglas 1, Andrzej Cichocki 2, and Shun-ichi Amari 2
MULTICHANNEL BLIND SEPARATION AND DECONVOLUTION OF SOURCES WITH ARBITRARY DISTRIBUTIONS Scott C. Douglas 1, Andrzej Cichoci, and Shun-ichi Amari 1 Department of Electrical Engineering, University of Utah
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationNeural Networks, Computation Graphs. CMSC 470 Marine Carpuat
Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ
More informationManifold Regularization
9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,
More informationESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.
On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained
More informationMultilayer Neural Networks
Multilayer Neural Networks Introduction Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the
More informationArtificial Intelligence
Artificial Intelligence Jeff Clune Assistant Professor Evolving Artificial Intelligence Laboratory Announcements Be making progress on your projects! Three Types of Learning Unsupervised Supervised Reinforcement
More informationDistinguishing Causes from Effects using Nonlinear Acyclic Causal Models
Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Kun Zhang Dept of Computer Science and HIIT University of Helsinki 14 Helsinki, Finland kun.zhang@cs.helsinki.fi Aapo Hyvärinen
More informationNeural Networks Lecture 4: Radial Bases Function Networks
Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi
More informationBatch-mode, on-line, cyclic, and almost cyclic learning 1 1 Introduction In most neural-network applications, learning plays an essential role. Throug
A theoretical comparison of batch-mode, on-line, cyclic, and almost cyclic learning Tom Heskes and Wim Wiegerinck RWC 1 Novel Functions SNN 2 Laboratory, Department of Medical hysics and Biophysics, University
More informationNeural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea
Neural Networks Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea What is Neural Networks? One of supervised learning method using one or more hidden layer.
More informationNew concepts: Span of a vector set, matrix column space (range) Linearly dependent set of vectors Matrix null space
Lesson 6: Linear independence, matrix column space and null space New concepts: Span of a vector set, matrix column space (range) Linearly dependent set of vectors Matrix null space Two linear systems:
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationData Mining (Mineria de Dades)
Data Mining (Mineria de Dades) Lluís A. Belanche belanche@lsi.upc.edu Soft Computing Research Group Dept. de Llenguatges i Sistemes Informàtics (Software department) Universitat Politècnica de Catalunya
More information[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]
Evaluating Hypotheses [Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4] Sample error, true error Condence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution,
More informationSlide a window along the input arc sequence S. Least-squares estimate. σ 2. σ Estimate 1. Statistically test the difference between θ 1 and θ 2
Corner Detection 2D Image Features Corners are important two dimensional features. Two dimensional image features are interesting local structures. They include junctions of dierent types Slide 3 They
More informationEmpirical Bayes for Learning to Learn. learn" described in, among others, Baxter (1997) and. Thrun and Pratt (1997).
Empirical Bayes for Learning to Learn Tom Heskes SNN, University of Nijmegen, Geert Grooteplein 21, Nijmegen, 6525 EZ, The Netherlands tom@mbfys.kun.nl Abstract We present a new model for studying multitask
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationAdvanced statistical methods for data analysis Lecture 2
Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationArtificial Neural Networks
Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge
More informationUnit III. A Survey of Neural Network Model
Unit III A Survey of Neural Network Model 1 Single Layer Perceptron Perceptron the first adaptive network architecture was invented by Frank Rosenblatt in 1957. It can be used for the classification of
More informationCS:4420 Artificial Intelligence
CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart
More informationNew Insights and Perspectives on the Natural Gradient Method
1 / 18 New Insights and Perspectives on the Natural Gradient Method Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology March 13, 2018 Motivation 2 / 18
More informationFACTOR MAPS BETWEEN TILING DYNAMICAL SYSTEMS KARL PETERSEN. systems which cannot be achieved by working within a nite window. By. 1.
FACTOR MAPS BETWEEN TILING DYNAMICAL SYSTEMS KARL PETERSEN Abstract. We show that there is no Curtis-Hedlund-Lyndon Theorem for factor maps between tiling dynamical systems: there are codes between such
More informationARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD
ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided
More informationLECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form
LECTURE # - EURAL COPUTATIO, Feb 4, 4 Linear Regression Assumes a functional form f (, θ) = θ θ θ K θ (Eq) where = (,, ) are the attributes and θ = (θ, θ, θ ) are the function parameters Eample: f (, θ)
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationTheory IIIb: Generalization in Deep Networks
CBMM Memo No. 90 June 29, 2018 Theory IIIb: Generalization in Deep Networks Tomaso Poggio 1, Qianli Liao 1, Brando Miranda 1, Andrzej Banburski 1, Xavier Boix 1 and Jack Hidary 2 1 Center for Brains, Minds,
More information