Conjugate gradient algorithm for training neural networks

Size: px
Start display at page:

Download "Conjugate gradient algorithm for training neural networks"

Transcription

1 . Introduction Recall that in the steepest-descent neural network training algorithm, consecutive line-search directions are orthogonal, such that, where, gwt [ ( + ) ] denotes E[ w( t + ) ], the gradient of the error E with respect to the weights w( t + ) at step ( t + ) of the training algorithm, and d() t denotes the line-search direction at step t of the training algorithm. In steepest descent, of course, and, where d() t g() t, () w( t + ) w() t ηt ()g() t, (3) is determined at each step through line minimization.. Deriving the conjugate gradient algorithm A. Introduction As we have seen, choosing consecutive line-search directions as in equation () can lead to oscillatory behavior in the training algorithm, even for simple quadratic error surfaces. This can significantly slow down convergence of the algorithm. To combat this, we should ideally choose consecutive line-search directions that are non-interfering in other words, consecutive search directions that do not undo the progress of previous search directions. Suppose we have executed a line-search along direction d() t at time step t ; then the next search direction d( t + ) should be chosen so that, gwt [ ( + ) + ηd( t + ) ] T d() t, η. (4) Condition (4) explicitly preserves progress made in the previous line search (see Figure 7. in []), and will guide us through our development of the conjugate gradient training algorithm. Below, we closely follow the derivation in [], but provide somewhat greater detail in certain parts of the derivation. B. First step gwt [ ( + ) ] T d() t η() t Lemma: Condition (4) implies, d( t + ) T Hd() t to first order, where H is the Hessian matrix with elements H ( ij, ), H E ( ij, ) , (6) ω i ω j evaluated at w( t + ), where, () (5) w T ω ω ω W. (7) Proof: First, let us approximate gw ( ) by its first-order Taylor approximation about w( t + ), gw ( ) T gwt [ ( + ) ] T + [ w w( t + ) ] T { gwt [ ( + ) ]} and evaluate the right-hand side of equation (8) at w( t + ) + ηd( t + ) : (8) gwt [ ( + ) + ηd( t + ) ] T gwt [ ( + ) ] T + ηd( t + ) T H. (9) Note from equation (9) that, - -

2 H { E[ w( t + ) ]} { gwt [ ( + ) ]}. () Substituting equation (9) into (4), { gwt [ ( + ) ] T + ηd( t + ) T H}d() t () From (), gwt [ ( + ) ] T d() t + ηd( t + ) T Hd() t. () gwt [ ( + ) ] T d() t, (3) so that equation () reduces to, ηd( t + ) T Hd() t. (4) Equation (4) implies either η and/or d( t + ) T Hd() t. We can dismiss the first of these implications ( η ), since a zero learning parameter would mean that we are not changing the weights. Therefore, (5) follows from (4) and the proof is complete. Note that for quadratic error surfaces, the first-order Taylor approximation in (8) is exact. For non-quadratic error surfaces, it is approximately correct in the locally quadratic neighborhood of the weight vector w( t + ). Search directions that meet condition (5) are called H -orthogonal or conjugate with respect to H. Our overall goal in developing the conjugate gradient algorithm is to construct consecutive search directions d() t that meet condition (5) without explicit computation of the large Hessian matrix ( H ). Note that an algorithm that uses H -orthogonal search directions will converge to the minimum of the error surface (for quadratic or locally quadratic error surfaces) in at most W steps, where W is the number of weights in the neural network. C. Linear algebra preliminaries In our derivation of the conjugate gradient algorithm, we will assume a locally quadratic error surface of the following form: E( w) E( w) E + b T w + --wt Hw (5) where E is a constant scalar, b is a constant vector, and H is a constant matrix (i.e. the Hessian). Furthermore, we assume that H is symmetric and positive-definite, a property that we will sometimes denote by H >. Definition: A square matrix H is positive-definite, if and only if all its eigenvalues λ i are greater than zero. If a matrix is positive-definite, then, v T Hv >, v. (6) Note that for a quadratic error surface, H is constant and is therefore guaranteed to be positive-definite everywhere. Consider, for example, the following simply quadratic error surface of two weights, E ω + ω. (7) For (7), we have previously computed the Hessian to be, H 4 (8) with eigenvalues λ 4 and λ. Hence, H is positive-definite. Note also, that for any error function with continuous partial derivatives, the Hessian is guaranteed to be symmetric since, - -

3 E ω i ω j E ω j ω i (9) Theorem: For a positive-definite square matrix H, H -orthogonal vectors { d, d,, d k } are linearly independent. Proof: For linear independence, we must show that, α d + α d + + α k d k () if and only if α i, i. In equation (), denotes a vector with equal dimensionality as d i and all zero elements. Let us first, pre-multiply equation () by d T i H, α d T i Hd + α d T i Hd + + α k d T i Hd k d T i H α d T i Hd + α d T i Hd + + α k d T i Hd k () () Note that since we have assumed that the d i, i {,,, k}, vectors are H -orthogonal, i.e., d T i Hd j, i j, (3) all terms in () drop out except the α i d T i Hd i term, so that () reduces to, α i d T i Hd i (4) Since H is positive-definite, however, we know from (6) that, d T i Hd i > (5) so that in order for (4) to be true, α i, i {,,, k}, (6) and the proof is complete. From this theorem, it follows directly that in a W -dimensional space, H -orthogonal vectors d i, i {,,, W}, form a complete, albeit non-orthogonal, basis set. This means that any vector v in that space may be expressed as the linear combination of the d i vectors, i.e., v W α i d i. (7) i D. The setup This section establishes the foundations of the conjugate gradient algorithm. In our derivation here and subsequent sections, we will assume a locally quadratic error surface of form (5), E( w) E + b T w +. (8) --wt Hw For now, let us also assume a set of H -orthogonal (or conjugate) vectors d i, i {,,, W}. Given these conjugate vectors, let us see how we can move from some initial (nonzero) weight vector w to the minimum w of error surface (8). From (7) we can express the difference between w and w as, W w w α i d i i so that, (9) - 3 -

4 W w w + α i d i. (3) Then, if we define, i j w j w + α i d i i we can rewrite equation (3) recursively as, (3) w j + w j + α j d j. (3) Note that equation (3) represents a sequence of steps in weight space from w to w. These steps are parallel to the conjugate directions d j, and the length of each step is controlled by α j. Equation (3) forms the basis of the conjugate gradient algorithm for training neural networks and is similar to previous training algorithms that we have studied in that it offers a recursive minimization procedure for the weights w. The big difference to previous algorithms is, of course, that d j g j, as was the case for the gradient and steepest descent algorithms. Also, the step size α j is not held constant, as is typical in gradient descent. At this point, several big questions remain unanswered, however. First, how do we construct the dj vectors given an error surface? Second, how do we determine the corresponding step sizes α j such that we are guaranteed convergence in at most W steps on a locally quadratic error surface? Finally, how can we do both without explicit computation of the Hessian H? Below, we answer these questions one by one. E. Computing correct step sizes Let us first attempt to find an expression for α j, assuming that we have a set of W conjugate vectors d i, i {,,, W}. To do this, let us pre-multiply equation (9) by d T j H : W d T j Hw ( w ) d T j H α i d i i (33) W d T j Hw ( w ) α i d T j Hd i i By H -orthogonality, all the terms in the right-hand sum of equation (34) where i j are zero. Thus, equation (34) reduces to, d T j Hw ( w ) α j d T j Hd j. (35) Now, note that since the gradient of (8) is given by, gw ( ) b+ Hw, (36) the minimum w can be expressed implicitly as, (34) so that, gw ( ) (37) b+ Hw, (38) Hw b. (39) Consequently, equation (35) further reduces to, dt j ( b Hw ) α j d T j Hd j, (4) - 4 -

5 d T j ( b+ Hw ) α j d T j Hd j. (4) Thus, d T j ( b+ Hw ) α j d T j Hd j (4) Although equation (4) gives an expression for α j, the right-hand side of (4) is dependent on the initial weights w. We can remove this dependence. Starting with equation (3), j w j w + α i d i i we (once again) pre-multiply by d T j H, (43) j d T j Hw j d T j Hw + α i d T j Hd i. (44) i Since i j for all terms in the sum on the right-hand side of equation (44), the sum is zero by H -orthogonality. Thus, equation (44) reduces to, d T j Hw j d T j Hw Substituting (45) into (4) and letting, g j b+ Hw j equation (4) further simplifies to,. (45), (46) d T j ( b+ Hw j ) α j d T j Hd j (47) α j d T j g j d T j Hd j (48) Below, we proof that iteration (3), with step size α j as given in (48), will converge to w in at most W steps from any nonzero initial weight vector w and a quadratic error surface. Theorem: Assuming a W -dimensional quadratic error surface, E( w) E + b T w +, (49) --wt Hw and H -orthogonal vectors d i, i {,,, W}, the following weight iteration, w j + w j + α j d j, (5) α j d T j g j , (5) d T j Hd j will converge in at most W steps to the minimum w of the error surface E( w) from any nonzero initial weight vector w. Proof: We will proof this theorem by showing that the gradient vector gj at the j th iteration is orthogonal to all previous conjugate directions d i, i< j. From this, it follows that after W steps, the components of the gradient along all directions will have been made zero, and consequently, iteration (5) will have converged to the minimum w

6 Since, g j b+ Hw j (5) we can write, g j + g j Hw ( j + w j ). (53) From (5), ( w j + w j ) α j d j so that (53) reduces to, g j + g j α j Hd j. (55) Not surprisingly, we now pre-multiply equation (55) by dt j, (54) d T( j g j + g ) j α dthd j j j and substitute (5) for α j so that, (56) d T j ( g j + g j ) d T j g j d T dj T Hd j Hd j j (57) d T( j g j + g ) d j Tg j j d T j g j + d T j g j d T j g j (58) (59) d T j g j +. (6) Therefore, from the result in (6), we know that the gradient at the ( j + ) th step is orthogonal to the previous direction d j. To show that the same is also true for all previous conjugate directions d k, k< j, let us premultiply equation (55) by dt k, d T( k g j + g ) j α dthd. (6) j k j Since k< j, the right-hand side of equation (6) is zero by H -orthogonality. Thus, (6) reduces to, d T k ( g j + g j ) (6) d T k g j + d T k g j, k< j. (63) Through induction of equations (63) and (6), we can easily show that, d T k g j, k < j. (64) For example, let k j. From (63), dt j g j + dt j g j, (65) and from (6), so that, dt j g j (66) dt j g j +. (67) Now we can let k j, j 3,, and repeat the steps for k j. This completes the proof

7 F. Constructing conjugate vectors So far in our development, we have assumed a set of mutually H -orthogonal vectors { d j }, but have not talked about how to construct these vectors. The theorem below gives one iterative method for doing just that. Theorem: Let d j be defined as follows:. Let,. Let, d g (68) d j + g j + + β j d j, j, (69) where, gt j + Hd j β j d T j Hd j (7) The construction of (68), (69) and (7) generates W mutually H -orthogonal vectors on a W -dimensional quadratic error surface, such that, d T j Hd i, i j. (7) Proof: First, we will show that our choice of β j in (7) implies, dt j + Hd j. (7) In other words, we will show that consecutive search directions are d T j H, H -orthogonal. Pre-multiplying (69) by d T j Hd j + d T j Hg j + + β j d T j Hd j, (73) and substituting (7) into (73), we get, d T j Hd j + d T gt j + Hd j j Hg j d T dj T Hd j Hd j j (74) d T j Hd j + d T j Hg j + + gt j + Hd j. (75) Next, we demonstrate that if, d T j Hd i dt j + Hd i, i < j, (76) ( means implies ), then, d T j Hd i, i j. (77) This can be shown by induction of (7) and (76); for example, d T Hd d T 3 Hd d T 3 Hd d T 4 Hd (from (7), by construction) (78) (from (78) and (76), by assumption) (79) (from (7), by construction) (8) (from (79) and (76), by assumption) (8) - 7 -

8 d T 4 Hd d T 4 Hd 3 (from (8) and (76), by assumption) (8) (from (7), by construction), etc. (83) Thus, to complete the proof, we need to show that (76) is true. First, we transpose and post-multiply (69) by Hd i, i< j, dt j + Hd i gt j + Hd i + β j d T j Hd i (84) Now, by assumption (76),. Hence, dt j + Hd i gt j + Hd i. (85) From (3), we have that, Since, d T j Hd i w j + w j α j d j, (86) Hw ( j + w ) j α Hd. (87) j j g j Hw j + b, (88) Hw ( j + w ) j g. (89) j + g j Combining equations (87) and (89), α j Hd j g j + g j (9) Hd j ---- ( g. (9) α j + g j ) j We substitute equation (9) into the right-hand side of equation (85) [switching the index from j to i in equation (9)], so that, dt j + Hd i ----g T, (9) α j + ( g i + g i ) i dt j + Hd i ---- ( gt, (93) α j + g i + gt j + g i ) i dt j + Hd i ---- ( gt. (94) α i + g j + g T i g j + ) i Now, we will show that the right-hand side of equation (94) is zero, by showing that, g T k g j, k < j. (95) This will complete the proof. From (68) and (69), we see that d k, k, is a linear combination of all previous gradient directions, d k k g k + γ l g l, (96) l where the γ l are scalar coefficients. For example, d g, (97) - 8 -

9 d g + β d g β g, (98) d 3 g 3 + β d g 3 β g β β g. (99) We have already shown, d T k g j, k < j, () so that if we transpose and post-multiply equation (96) by g j, j> k, d k T g j k g T k g j + γ l g T l g k, () l k g T k g j + γ l g T l g k l [from ()], () g k T g j l Since d g, k γ l g T l g k. (3) d T g j g T g j, j >, [from ()], (4) g T g j, j >. (5) By induction of equations (3) and (5), we can now show that (95) is true; for example, g T g [from (5)] (6) g T g 3 γ g T g 3 [from (3) and (5)] (7) g T g 4 [from (5)] (8) g T g 4 γ g T g 4 [from (3) and (5)] (9) g T 3 g 4 γ g T g 4 + γ g T g 4 [from (3) and (9)], etc. () Thus, so that, g T k g j, k < j, () dt j + Hd i ---- ( gt,. () α i + g j + g T i g j + ) i< j i Recall that equation () presupposes, so that we have shown equation (76) to be true; i.e., dt j + Hd i when d T j Hd i, i < j. (3) Thus, by induction of equations (7) and (76), the construction in (68), (69) and (7) generates mutually H - orthogonal vectors on a quadratic error surface, such that, d T j Hd i, i j, (4) and the proof is complete. d T j Hd i - 9 -

10 G. Computing β j without the Hessian We now have the following algorithm for training the weights of a neural network:. Let, d g, (5) where w is a random initial weight vector.. For j let, w j + w j + α j d j where, (6) α j d T j g j d T j Hd j (7) 3. Let, d j + g j + + β j d j, (8) where, gt j + Hd j β j d T j Hd j (9) 4. Iterate steps and 3. Ideally, we would like to compute both α j and β j without explicit computation of H. Let us look at β j first. Substituting equation (9), Hd j ---- ( g α j + g j ) j () into equation (9), we get, gt j + ( g j + g j ) β j () d T j ( g j + g j ) which is known as the Hestenes-Stiefel expression for β j. Expression () can be further simplified. From equation (8), d j g j + β j d j. () Then, transposing and post-multiplying equation () by g j, we get, d j T g j g T j g j + β j dt j g j. (3) Since d T k g j, k < j, (4) equation (3) reduces to, d j T g j g T j g j, (5) g T j g j d T j g j. (6) - -

11 Consequently, equation () reduces to, gt j + ( g j + g j ) β j g T j g j (7) which is known as the Polak-Ribiere expression for β j. Finally since, g T k g j, k < j, (8) equation (7) further reduces to, gt j' + g j + β j g T j g j (9) which is known as the Fletcher-Reeves expression for β j. Note that none of the three β j expressions (equations (), (7) and (9), respectively) requires computation of the local Hessian H. Also, note that on a purely quadratic error surface, the three expressions above are identical. Since a neural network error function is typically only approximately quadratic, however, the different expressions for β j will give slightly different results. According to [], the Polak-Ribiere form in (7) usually gives slightly better results in practice than the other two expressions. With (7), when the algorithm is making poor progress, β j will be approximately zero, so that the search direction will be reset to the negative gradient, effectively re-initializing or restarting the conjugate gradient algorithm. H. Computing α j without the Hessian As we will show below, rather than compute α j (5), (the step size along each search direction) through equation α j d T j g j p (3) d T j Hd j we can instead determine α j through a line minimization along d j. We will show below that for a quadratic error surface, this line minimization is equivalent to computing α j explicitly. For a quadratic error surface, E( w) E + b T w + --wt Hw (3) we have that, E( w j + α j d j ) E b T + ( w j + α j d j ) + -- ( w. (3) j + α j d j ) T Hw ( j + α j d j ) We now differentiate equation (3) with respect to α j and equate to zero: E( w j + α j d j ) b α T d j --d j T + j Hw ( j + α j d j ) + -- ( w j + α j d j ) T Hd j (33) Since H is symmetric, d T j Hw ( j + α j d j ) ( w j + α j d j ) T Hd j (34) equation (33) reduces to, b T d j + d T j Hw ( j + α j d j ) d T j b+ d T j Hw j + α j d T j Hd j (35) (36) - -

12 α j d T j Hd j d T j ( b+ Hw j ) (37) Since, g j b+ Hw j equation (37) reduces to,, (38) α j d j T Hd j d T j g j (39) α j d T j g j d T j Hd j (4) Note that equation (4) is equivalent to equation (3), which was derived independently from theoretical considerations. Therefore, we have shown that we may safely replace explicit computation of α j with a line minimization along d j. 3. The conjugate gradient algorithm We have now arrived at the complete conjugate gradient algorithm, which we summarize below:. Choose an initial weight vector w (e.g. vector of small, random weights).. Evaluate g E[ w ] and let,. (4) 3. At step j, j, perform a line minimization along d j such that, 4. Let, d g E( w j + α d j ) E( w j + αd j ), α. (4) w j + w j + α d j 5. Evaluate g j Let, d j + g j + + β j d j where, gt j + ( g j + g j ) β j g T j g j (43) (44). (45) 7. Let j j+ and go to step 3. Note that at no time does the above algorithm require explicit computation of the Hessian H. In fact, the conjugate gradient algorithm requires very little additional computation compared to the steepest descent algorithm. It does, however, exhibit significantly faster convergence properties. In fact, on a W -dimensional quadratic error surface, the conjugate gradient algorithm is guaranteed to converge to the minimum in W steps; no such guarantees exist for the steepest descent algorithm. The main difference between the two algorithms is that the conjugate gradient algorithm exploits a reasonable assumption about the error surface (i.e. locally quadratic) to achieve faster convergence. A neural network error surface will in general, of course, not be globally quadratic. Therefore, in practice, more than W steps will usually be required before convergence to a local minimum. Consequently, in applying the conjugate gradient algorithm to neural network training, we typically reset the algorithm every W steps; that is, - -

13 we reset the search direction to the negative gradient [equation (4)]. Finally we note that throughout our development, we have assumed that the Hessian H is positive definite. When H is not positive definite, the line minimization in step 3 of the algorithm prevents the algorithm from making negative progress (i.e. increasing the error of the neural network). 4. Examples In this section, we will examine the conjugate gradient algorithm for some simple examples, including () a simple quadratic error surface, () a simple non-quadratic error surface, and (3) a simple neural network training example. We will see that the conjugate gradient algorithm significantly outperforms both the steepest descent and gradient descent algorithms in speed of convergence. A. Quadratic example Here we consider the simple quadratic error surface, E ω + ω. (46) For this error surface and from initial weights ( ω, ω ) (, ), Figure below plots the trajectory in weight space of () the conjugate gradient algorithm, () the steepest descent algorithm, and (3) the gradient descent algorithm with a learning rate of η.4. Each trajectory is superimposed on top of a contour plot of E and is stopped when E < 6. Note that the minimum for E occurs at ( ω, ω ) (, ). From Figure, we make several observations. First, as expected, the conjugate gradient algorithm does indeed converge in two steps for a two-dimensional quadratic error surface. By contrast, the steepest descent algorithm requires 5 steps to converge to an error E < 6, even though the two algorithms require approximately the same amount of computation per step. The gradient descent algorithm comes in a distant third at 75 steps to convergence for a learning rate η.4, and this does not count the amount of computation performed in searching for the near-optimal value of the learning rate η. In fact, one of the nicest features of the conjugate gradient algorithm is that it requires no hand-selection of any learning parameter, momentum term, etc ω ω ω A legitimate complaint about Figure is that the conjugate gradient algorithm is specifically designed for a quadratic error surface, so that its superior performance in this case is not surprising in other words, the skeptical reader might easily conclude that the comparison in Figure is rigged. Therefore, we consider a decidedly non-quadratic error surface in the following section. B. Non-quadratic example Here we consider the simple non-quadratic error surface, ω ω ω conjugate gradient steepest descent gradient descent ( η.4) steps to convergence 5 steps to convergence 75 steps to convergence Figure - 3 -

14 E ω - ω - Figure E exp( 5ω ω ), (47) which is plotted in Figure. This error surface is characteristic of some regions in weight space for neural network training problems in that for certain weights, the gradient is close to zero. For this error surface and from initial weights ( ω, ω ) (.,.5), Figure 3 below plots the trajectory in weight space of () the conjugate gradient algorithm, () the steepest descent algorithm, and (3) the gradient descent algorithm with a learning rate of η.. Each trajectory is superimposed on top of a contour plot of E and is stopped when E < 6. Note that the minimum for E occurs at ( ω, ω ) (, ). From Figure 3, we make several observations. First, the conjugate gradient algorithm no longer converges in just two steps (for a two-dimensional surface), since E in (47) is not quadratic; it does, however, converge in four short steps to E < 6. By contrast, the steepest descent algorithm requires 5 steps to converge to an error E < 6, and the gradient descent algorithm comes in a distant third at 6 steps to convergence for a learning rate η. ; once again, the 6 steps for gradient descent does not account for the computation performed in searching for the near-optimal value of the learning rate η. The observant reader may still be skeptical. Why, you ask, did we choose (.,.5) as the initial weight vector for this problem? Caught once again! It turns out that these particular initial weights fall just inside the region of the weight space where the Hessian H is positive definite. The region in Figure 4 encircled by the heavy line corresponds to that region in weight space where H >. Note that (.,.5) falls just inside that region. So how does the conjugate gradient algorithm fare when H is not positive definite? Figure 5 answers this question. In Figure 5, we initialize the weights to ( ω, ω ) (, ), outside the positive-definite region, and once again track the conjugate gradient, steepest descent and gradient descent (with η. ) algorithms. Note that once again, the conjugate gradient algorithm clearly outperforms its two competitors. In fact, the ω ω ω ω ω ω conjugate gradient steepest descent gradient descent ( η.) 4 steps to convergence 5 steps to convergence 6 steps to convergence Figure 3-4 -

15 .6.4. ω ω Figure ω ω ω ω ω ω conjugate gradient steepest descent gradient descent ( η.) 5 steps to convergence 4 steps to convergence 94 steps to convergence Figure 5 gradient descent algorithm degenerates to an awful 94 steps to convergence this poor performance is typical of gradient descent in flat regions of the error surface. C. Neural-network training example Finally, we look at a simple neural network training example. Here we are interested in approximating, y sin( πx), x, (see Figure 6) (48) with a single-layer, three-hidden-unit neural network (see Figure 6) and symmetric sigmoidal activation functions, γ ( u) (49) + e u -- Our training data for this problem consists of N 5 evenly spaced ( xy, ) points sampled from x [, ]. Figure 7 plots, log E N --- (5). Due to an oversight, no connections from the bias unit to the output unit was included in this network. Thus, there are a total of nine trainable weights in this neural network

16 z.8.6 y.4. as a function of the number of steps (or epochs) of the conjugate training algorithm for the first steps. By comparison, Figure 8 illustrates that steepest descent does not come close to approaching the conjugate gradient algorithm s performance, even after 5 steps. Finally, we note that gradient descent (with η.3 ) requires, epochs to reach approximately the same error as the conjugate gradient algorithm does in epochs, as shown in Figure 9. Once again, the conjugate gradient algorithm converges in many fewer steps than either of its two competitors. D. Postscript Let us look at two more issues with regard to the neural network training example of the previous section. First, Figure plots the neural network s progress in approximating y as the number of epochs (steps) increases from to. Note how the conjugate gradient algorithm slowly begins to bend an initially linear approximation to ultimately generate a very good fit of the sine curve. Second, it is interesting to examine how this neural network, with sigmoidal activation functions centered about the origin [see equation (49)] is able to approximate the function y (with non-zero average value) without a bias connection to the output unit z. Let us denote z i, i { 3,, }, as the net contribution from hidden unit i to the output z, so that, In other words, z z + z + z 3 x. (5) x Figure 6 z i ω i h i, i { 3,, }, (5) where ω i denotes the weight from the i th hidden unit to the output, and h i denotes the output of hidden unit i. Figure (top) plots each of the z i, which are easily recognized as different parts of the symmetric sigmoid function in equation (49). To understand how they combine to form an approximation of y, we separately plot z + z and z 3 in Figure (bottom). Note that z 3 is primarily responsible for the overall shape of the z function, while z + z together pull the ends of z to conform to the familiar sine curve shape. 5. Conclusion We have derived the conjugate gradient algorithm and illustrated its properties on some simple examples. We have shown that it outperforms both the gradient descent as well as the steepest descent algorithms in terms of the number of computations required to reach convergence. This improvement has been achieved primarily by accounting for the local properties of error surfaces, allowing us to employ gradient information in a more intelligent manner; furthermore, the conjugate gradient algorithm makes unnecessary the often painful trialand-error selection of tunable learning parameters. [] C. M. Bishop, Chapter 7: Parameter Optimization Algorithms, Neural Networks for Pattern Recognition, Oxford University Press, Oxford,

17 -.5 log E --- N conjugate gradient algorithm - Figure number of epochs -.5 steepest descent algorithm - log E --- N -.5 Figure 8 - conjugate gradient algorithm number of epochs log E --- N -.5 gradient descent algorithm Figure 9 - conjugate gradient algorithm number of epochs - 7 -

18 epoch 5 epochs epochs epochs epochs 9 epochs epochs epochs Figure - 8 -

19 z z 3 - z x z + z z x Figure x - 9 -

Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples Conjugate gradient training algorithm Steepest descent algorithm So far: Heuristic improvements to gradient descent (momentum Steepest descent training algorithm Can we do better? Definitions: weight vector

More information

Chapter 10 Conjugate Direction Methods

Chapter 10 Conjugate Direction Methods Chapter 10 Conjugate Direction Methods An Introduction to Optimization Spring, 2012 1 Wei-Ta Chu 2012/4/13 Introduction Conjugate direction methods can be viewed as being intermediate between the method

More information

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière Hestenes Stiefel January 29, 2014 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes

More information

Conjugate Directions for Stochastic Gradient Descent

Conjugate Directions for Stochastic Gradient Descent Conjugate Directions for Stochastic Gradient Descent Nicol N Schraudolph Thore Graepel Institute of Computational Science ETH Zürich, Switzerland {schraudo,graepel}@infethzch Abstract The method of conjugate

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Lecture 5, Continuous Optimisation Oxford University Computing Laboratory, HT 2006 Notes by Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The notion of complexity (per iteration)

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

The Conjugate Gradient Algorithm

The Conjugate Gradient Algorithm Optimization over a Subspace Conjugate Direction Methods Conjugate Gradient Algorithm Non-Quadratic Conjugate Gradient Algorithm Optimization over a Subspace Consider the problem min f (x) subject to x

More information

Chapter 4. Unconstrained optimization

Chapter 4. Unconstrained optimization Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file

More information

Chapter 8 Gradient Methods

Chapter 8 Gradient Methods Chapter 8 Gradient Methods An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Introduction Recall that a level set of a function is the set of points satisfying for some constant. Thus, a point

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Performance Surfaces and Optimum Points

Performance Surfaces and Optimum Points CSC 302 1.5 Neural Networks Performance Surfaces and Optimum Points 1 Entrance Performance learning is another important class of learning law. Network parameters are adjusted to optimize the performance

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

FALL 2018 MATH 4211/6211 Optimization Homework 4

FALL 2018 MATH 4211/6211 Optimization Homework 4 FALL 2018 MATH 4211/6211 Optimization Homework 4 This homework assignment is open to textbook, reference books, slides, and online resources, excluding any direct solution to the problem (such as solution

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Gradient Methods Using Momentum and Memory

Gradient Methods Using Momentum and Memory Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set

More information

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

Lecture 10: September 26

Lecture 10: September 26 0-725: Optimization Fall 202 Lecture 0: September 26 Lecturer: Barnabas Poczos/Ryan Tibshirani Scribes: Yipei Wang, Zhiguang Huo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Eigen Vector Descent and Line Search for Multilayer Perceptron

Eigen Vector Descent and Line Search for Multilayer Perceptron igen Vector Descent and Line Search for Multilayer Perceptron Seiya Satoh and Ryohei Nakano Abstract As learning methods of a multilayer perceptron (MLP), we have the BP algorithm, Newton s method, quasi-

More information

Combining Conjugate Direction Methods with Stochastic Approximation of Gradients

Combining Conjugate Direction Methods with Stochastic Approximation of Gradients Combining Conjugate Direction Methods with Stochastic Approximation of Gradients Nicol N Schraudolph Thore Graepel Institute of Computational Sciences Eidgenössische Technische Hochschule (ETH) CH-8092

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428

More information

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Lecture Notes: Geometric Considerations in Unconstrained Optimization Lecture Notes: Geometric Considerations in Unconstrained Optimization James T. Allison February 15, 2006 The primary objectives of this lecture on unconstrained optimization are to: Establish connections

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

Data Mining (Mineria de Dades)

Data Mining (Mineria de Dades) Data Mining (Mineria de Dades) Lluís A. Belanche belanche@lsi.upc.edu Soft Computing Research Group Dept. de Llenguatges i Sistemes Informàtics (Software department) Universitat Politècnica de Catalunya

More information

New hybrid conjugate gradient methods with the generalized Wolfe line search

New hybrid conjugate gradient methods with the generalized Wolfe line search Xu and Kong SpringerPlus (016)5:881 DOI 10.1186/s40064-016-5-9 METHODOLOGY New hybrid conjugate gradient methods with the generalized Wolfe line search Open Access Xiao Xu * and Fan yu Kong *Correspondence:

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Gradient Descent. Dr. Xiaowei Huang

Gradient Descent. Dr. Xiaowei Huang Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Notes on Some Methods for Solving Linear Systems

Notes on Some Methods for Solving Linear Systems Notes on Some Methods for Solving Linear Systems Dianne P. O Leary, 1983 and 1999 and 2007 September 25, 2007 When the matrix A is symmetric and positive definite, we have a whole new class of algorithms

More information

Numerical Optimization of Partial Differential Equations

Numerical Optimization of Partial Differential Equations Numerical Optimization of Partial Differential Equations Part I: basic optimization concepts in R n Bartosz Protas Department of Mathematics & Statistics McMaster University, Hamilton, Ontario, Canada

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

x k+1 = x k + α k p k (13.1)

x k+1 = x k + α k p k (13.1) 13 Gradient Descent Methods Lab Objective: Iterative optimization methods choose a search direction and a step size at each iteration One simple choice for the search direction is the negative gradient,

More information

Learning with Momentum, Conjugate Gradient Learning

Learning with Momentum, Conjugate Gradient Learning Learning with Momentum, Conjugate Gradient Learning Introduction to Neural Networks : Lecture 8 John A. Bullinaria, 2004 1. Visualising Learning 2. Learning with Momentum 3. Learning with Line Searches

More information

Today. Introduction to optimization Definition and motivation 1-dimensional methods. Multi-dimensional methods. General strategies, value-only methods

Today. Introduction to optimization Definition and motivation 1-dimensional methods. Multi-dimensional methods. General strategies, value-only methods Optimization Last time Root inding: deinition, motivation Algorithms: Bisection, alse position, secant, Newton-Raphson Convergence & tradeos Eample applications o Newton s method Root inding in > 1 dimension

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

y(x n, w) t n 2. (1)

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

More information

The Conjugate Gradient Method for Solving Linear Systems of Equations

The Conjugate Gradient Method for Solving Linear Systems of Equations The Conjugate Gradient Method for Solving Linear Systems of Equations Mike Rambo Mentor: Hans de Moor May 2016 Department of Mathematics, Saint Mary s College of California Contents 1 Introduction 2 2

More information

An Iterative Descent Method

An Iterative Descent Method Conjugate Gradient: An Iterative Descent Method The Plan Review Iterative Descent Conjugate Gradient Review : Iterative Descent Iterative Descent is an unconstrained optimization process x (k+1) = x (k)

More information

Numerical Optimization

Numerical Optimization Numerical Optimization Unit 2: Multivariable optimization problems Che-Rung Lee Scribe: February 28, 2011 (UNIT 2) Numerical Optimization February 28, 2011 1 / 17 Partial derivative of a two variable function

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Repeated Eigenvalues and Symmetric Matrices

Repeated Eigenvalues and Symmetric Matrices Repeated Eigenvalues and Symmetric Matrices. Introduction In this Section we further develop the theory of eigenvalues and eigenvectors in two distinct directions. Firstly we look at matrices where one

More information

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method. Optimization Unconstrained optimization One-dimensional Multi-dimensional Newton s method Basic Newton Gauss- Newton Quasi- Newton Descent methods Gradient descent Conjugate gradient Constrained optimization

More information

Step-size Estimation for Unconstrained Optimization Methods

Step-size Estimation for Unconstrained Optimization Methods Volume 24, N. 3, pp. 399 416, 2005 Copyright 2005 SBMAC ISSN 0101-8205 www.scielo.br/cam Step-size Estimation for Unconstrained Optimization Methods ZHEN-JUN SHI 1,2 and JIE SHEN 3 1 College of Operations

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction

More information

1 Numerical optimization

1 Numerical optimization Contents 1 Numerical optimization 5 1.1 Optimization of single-variable functions............ 5 1.1.1 Golden Section Search................... 6 1.1. Fibonacci Search...................... 8 1. Algorithms

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. John L. Weatherwax July 7, 2010 wax@alum.mit.edu 1 Chapter 5 (Conjugate Gradient Methods) Notes

More information

Lecture 35 Minimization and maximization of functions. Powell s method in multidimensions Conjugate gradient method. Annealing methods.

Lecture 35 Minimization and maximization of functions. Powell s method in multidimensions Conjugate gradient method. Annealing methods. Lecture 35 Minimization and maximization of functions Powell s method in multidimensions Conjugate gradient method. Annealing methods. We know how to minimize functions in one dimension. If we start at

More information

NonlinearOptimization

NonlinearOptimization 1/35 NonlinearOptimization Pavel Kordík Department of Computer Systems Faculty of Information Technology Czech Technical University in Prague Jiří Kašpar, Pavel Tvrdík, 2011 Unconstrained nonlinear optimization,

More information

Second-order Learning Algorithm with Squared Penalty Term

Second-order Learning Algorithm with Squared Penalty Term Second-order Learning Algorithm with Squared Penalty Term Kazumi Saito Ryohei Nakano NTT Communication Science Laboratories 2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 69-2 Japan {saito,nakano}@cslab.kecl.ntt.jp

More information

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09 Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods

More information

AI Programming CS F-20 Neural Networks

AI Programming CS F-20 Neural Networks AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols

More information

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks p.1/40 Subjects

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294) Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps

More information

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz In Neurocomputing 2(-3): 279-294 (998). A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models Isabelle Rivals and Léon Personnaz Laboratoire d'électronique,

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Math 411 Preliminaries

Math 411 Preliminaries Math 411 Preliminaries Provide a list of preliminary vocabulary and concepts Preliminary Basic Netwon s method, Taylor series expansion (for single and multiple variables), Eigenvalue, Eigenvector, Vector

More information

A vector from the origin to H, V could be expressed using:

A vector from the origin to H, V could be expressed using: Linear Discriminant Function: the linear discriminant function: g(x) = w t x + ω 0 x is the point, w is the weight vector, and ω 0 is the bias (t is the transpose). Two Category Case: In the two category

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

REVIEW OF DIFFERENTIAL CALCULUS

REVIEW OF DIFFERENTIAL CALCULUS REVIEW OF DIFFERENTIAL CALCULUS DONU ARAPURA 1. Limits and continuity To simplify the statements, we will often stick to two variables, but everything holds with any number of variables. Let f(x, y) be

More information

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron

More information

Stable Dynamic Parameter Adaptation

Stable Dynamic Parameter Adaptation Stable Dynamic Parameter Adaptation Stefan M. Riiger Fachbereich Informatik, Technische Universitat Berlin Sekr. FR 5-9, Franklinstr. 28/29 10587 Berlin, Germany async~cs. tu-berlin.de Abstract A stability

More information

Neural networks III: The delta learning rule with semilinear activation function

Neural networks III: The delta learning rule with semilinear activation function Neural networks III: The delta learning rule with semilinear activation function The standard delta rule essentially implements gradient descent in sum-squared error for linear activation functions. We

More information

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc. ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, 2015 1 Name: Solution Score: /100 This exam is closed-book. You must show ALL of your work for full credit. Please read the questions carefully.

More information

Optimization Methods

Optimization Methods Optimization Methods Decision making Examples: determining which ingredients and in what quantities to add to a mixture being made so that it will meet specifications on its composition allocating available

More information

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate

More information

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way. AMSC 607 / CMSC 878o Advanced Numerical Optimization Fall 2008 UNIT 3: Constrained Optimization PART 3: Penalty and Barrier Methods Dianne P. O Leary c 2008 Reference: N&S Chapter 16 Penalty and Barrier

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Introduction to feedforward neural networks

Introduction to feedforward neural networks . Problem statement and historical context A. Learning framework Figure below illustrates the basic framework that we will see in artificial neural network learning. We assume that we want to learn a classification

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

Numerical Computation for Deep Learning

Numerical Computation for Deep Learning Numerical Computation for Deep Learning Lecture slides for Chapter 4 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last modified 2017-10-14 Thanks to Justin Gilmer and Jacob Buckman for helpful

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

QM and Angular Momentum

QM and Angular Momentum Chapter 5 QM and Angular Momentum 5. Angular Momentum Operators In your Introductory Quantum Mechanics (QM) course you learned about the basic properties of low spin systems. Here we want to review that

More information

weightchanges alleviates many ofthe above problems. Momentum is an example of an improvement on our simple rst order method that keeps it rst order bu

weightchanges alleviates many ofthe above problems. Momentum is an example of an improvement on our simple rst order method that keeps it rst order bu Levenberg-Marquardt Optimization Sam Roweis Abstract Levenberg-Marquardt Optimization is a virtual standard in nonlinear optimization which signicantly outperforms gradient descent and conjugate gradient

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans:

More information

B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Gradient Descent Training Rule: The Details

Gradient Descent Training Rule: The Details Gradient Descent Training Rule: The Details 1 For Perceptrons The whole idea behind gradient descent is to gradually, but consistently, decrease the output error by adjusting the weights. The trick is

More information

Regularization and Optimization of Backpropagation

Regularization and Optimization of Backpropagation Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization

More information

Artifical Neural Networks

Artifical Neural Networks Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Supervised Learning in Neural Networks

Supervised Learning in Neural Networks The Norwegian University of Science and Technology (NTNU Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Classic K -means clustering. Classic K -means example (K = 2) Finding the optimal w k. Finding the optimal s n J =

Classic K -means clustering. Classic K -means example (K = 2) Finding the optimal w k. Finding the optimal s n J = Review of classic (GOF K -means clustering x 2 Fall 2015 x 1 Lecture 8, February 24, 2015 K-means is traditionally a clustering algorithm. Learning: Fit K prototypes w k (the rows of some matrix, W to

More information