Conjugate gradient algorithm for training neural networks
|
|
- Grant Booker
- 6 years ago
- Views:
Transcription
1 . Introduction Recall that in the steepest-descent neural network training algorithm, consecutive line-search directions are orthogonal, such that, where, gwt [ ( + ) ] denotes E[ w( t + ) ], the gradient of the error E with respect to the weights w( t + ) at step ( t + ) of the training algorithm, and d() t denotes the line-search direction at step t of the training algorithm. In steepest descent, of course, and, where d() t g() t, () w( t + ) w() t ηt ()g() t, (3) is determined at each step through line minimization.. Deriving the conjugate gradient algorithm A. Introduction As we have seen, choosing consecutive line-search directions as in equation () can lead to oscillatory behavior in the training algorithm, even for simple quadratic error surfaces. This can significantly slow down convergence of the algorithm. To combat this, we should ideally choose consecutive line-search directions that are non-interfering in other words, consecutive search directions that do not undo the progress of previous search directions. Suppose we have executed a line-search along direction d() t at time step t ; then the next search direction d( t + ) should be chosen so that, gwt [ ( + ) + ηd( t + ) ] T d() t, η. (4) Condition (4) explicitly preserves progress made in the previous line search (see Figure 7. in []), and will guide us through our development of the conjugate gradient training algorithm. Below, we closely follow the derivation in [], but provide somewhat greater detail in certain parts of the derivation. B. First step gwt [ ( + ) ] T d() t η() t Lemma: Condition (4) implies, d( t + ) T Hd() t to first order, where H is the Hessian matrix with elements H ( ij, ), H E ( ij, ) , (6) ω i ω j evaluated at w( t + ), where, () (5) w T ω ω ω W. (7) Proof: First, let us approximate gw ( ) by its first-order Taylor approximation about w( t + ), gw ( ) T gwt [ ( + ) ] T + [ w w( t + ) ] T { gwt [ ( + ) ]} and evaluate the right-hand side of equation (8) at w( t + ) + ηd( t + ) : (8) gwt [ ( + ) + ηd( t + ) ] T gwt [ ( + ) ] T + ηd( t + ) T H. (9) Note from equation (9) that, - -
2 H { E[ w( t + ) ]} { gwt [ ( + ) ]}. () Substituting equation (9) into (4), { gwt [ ( + ) ] T + ηd( t + ) T H}d() t () From (), gwt [ ( + ) ] T d() t + ηd( t + ) T Hd() t. () gwt [ ( + ) ] T d() t, (3) so that equation () reduces to, ηd( t + ) T Hd() t. (4) Equation (4) implies either η and/or d( t + ) T Hd() t. We can dismiss the first of these implications ( η ), since a zero learning parameter would mean that we are not changing the weights. Therefore, (5) follows from (4) and the proof is complete. Note that for quadratic error surfaces, the first-order Taylor approximation in (8) is exact. For non-quadratic error surfaces, it is approximately correct in the locally quadratic neighborhood of the weight vector w( t + ). Search directions that meet condition (5) are called H -orthogonal or conjugate with respect to H. Our overall goal in developing the conjugate gradient algorithm is to construct consecutive search directions d() t that meet condition (5) without explicit computation of the large Hessian matrix ( H ). Note that an algorithm that uses H -orthogonal search directions will converge to the minimum of the error surface (for quadratic or locally quadratic error surfaces) in at most W steps, where W is the number of weights in the neural network. C. Linear algebra preliminaries In our derivation of the conjugate gradient algorithm, we will assume a locally quadratic error surface of the following form: E( w) E( w) E + b T w + --wt Hw (5) where E is a constant scalar, b is a constant vector, and H is a constant matrix (i.e. the Hessian). Furthermore, we assume that H is symmetric and positive-definite, a property that we will sometimes denote by H >. Definition: A square matrix H is positive-definite, if and only if all its eigenvalues λ i are greater than zero. If a matrix is positive-definite, then, v T Hv >, v. (6) Note that for a quadratic error surface, H is constant and is therefore guaranteed to be positive-definite everywhere. Consider, for example, the following simply quadratic error surface of two weights, E ω + ω. (7) For (7), we have previously computed the Hessian to be, H 4 (8) with eigenvalues λ 4 and λ. Hence, H is positive-definite. Note also, that for any error function with continuous partial derivatives, the Hessian is guaranteed to be symmetric since, - -
3 E ω i ω j E ω j ω i (9) Theorem: For a positive-definite square matrix H, H -orthogonal vectors { d, d,, d k } are linearly independent. Proof: For linear independence, we must show that, α d + α d + + α k d k () if and only if α i, i. In equation (), denotes a vector with equal dimensionality as d i and all zero elements. Let us first, pre-multiply equation () by d T i H, α d T i Hd + α d T i Hd + + α k d T i Hd k d T i H α d T i Hd + α d T i Hd + + α k d T i Hd k () () Note that since we have assumed that the d i, i {,,, k}, vectors are H -orthogonal, i.e., d T i Hd j, i j, (3) all terms in () drop out except the α i d T i Hd i term, so that () reduces to, α i d T i Hd i (4) Since H is positive-definite, however, we know from (6) that, d T i Hd i > (5) so that in order for (4) to be true, α i, i {,,, k}, (6) and the proof is complete. From this theorem, it follows directly that in a W -dimensional space, H -orthogonal vectors d i, i {,,, W}, form a complete, albeit non-orthogonal, basis set. This means that any vector v in that space may be expressed as the linear combination of the d i vectors, i.e., v W α i d i. (7) i D. The setup This section establishes the foundations of the conjugate gradient algorithm. In our derivation here and subsequent sections, we will assume a locally quadratic error surface of form (5), E( w) E + b T w +. (8) --wt Hw For now, let us also assume a set of H -orthogonal (or conjugate) vectors d i, i {,,, W}. Given these conjugate vectors, let us see how we can move from some initial (nonzero) weight vector w to the minimum w of error surface (8). From (7) we can express the difference between w and w as, W w w α i d i i so that, (9) - 3 -
4 W w w + α i d i. (3) Then, if we define, i j w j w + α i d i i we can rewrite equation (3) recursively as, (3) w j + w j + α j d j. (3) Note that equation (3) represents a sequence of steps in weight space from w to w. These steps are parallel to the conjugate directions d j, and the length of each step is controlled by α j. Equation (3) forms the basis of the conjugate gradient algorithm for training neural networks and is similar to previous training algorithms that we have studied in that it offers a recursive minimization procedure for the weights w. The big difference to previous algorithms is, of course, that d j g j, as was the case for the gradient and steepest descent algorithms. Also, the step size α j is not held constant, as is typical in gradient descent. At this point, several big questions remain unanswered, however. First, how do we construct the dj vectors given an error surface? Second, how do we determine the corresponding step sizes α j such that we are guaranteed convergence in at most W steps on a locally quadratic error surface? Finally, how can we do both without explicit computation of the Hessian H? Below, we answer these questions one by one. E. Computing correct step sizes Let us first attempt to find an expression for α j, assuming that we have a set of W conjugate vectors d i, i {,,, W}. To do this, let us pre-multiply equation (9) by d T j H : W d T j Hw ( w ) d T j H α i d i i (33) W d T j Hw ( w ) α i d T j Hd i i By H -orthogonality, all the terms in the right-hand sum of equation (34) where i j are zero. Thus, equation (34) reduces to, d T j Hw ( w ) α j d T j Hd j. (35) Now, note that since the gradient of (8) is given by, gw ( ) b+ Hw, (36) the minimum w can be expressed implicitly as, (34) so that, gw ( ) (37) b+ Hw, (38) Hw b. (39) Consequently, equation (35) further reduces to, dt j ( b Hw ) α j d T j Hd j, (4) - 4 -
5 d T j ( b+ Hw ) α j d T j Hd j. (4) Thus, d T j ( b+ Hw ) α j d T j Hd j (4) Although equation (4) gives an expression for α j, the right-hand side of (4) is dependent on the initial weights w. We can remove this dependence. Starting with equation (3), j w j w + α i d i i we (once again) pre-multiply by d T j H, (43) j d T j Hw j d T j Hw + α i d T j Hd i. (44) i Since i j for all terms in the sum on the right-hand side of equation (44), the sum is zero by H -orthogonality. Thus, equation (44) reduces to, d T j Hw j d T j Hw Substituting (45) into (4) and letting, g j b+ Hw j equation (4) further simplifies to,. (45), (46) d T j ( b+ Hw j ) α j d T j Hd j (47) α j d T j g j d T j Hd j (48) Below, we proof that iteration (3), with step size α j as given in (48), will converge to w in at most W steps from any nonzero initial weight vector w and a quadratic error surface. Theorem: Assuming a W -dimensional quadratic error surface, E( w) E + b T w +, (49) --wt Hw and H -orthogonal vectors d i, i {,,, W}, the following weight iteration, w j + w j + α j d j, (5) α j d T j g j , (5) d T j Hd j will converge in at most W steps to the minimum w of the error surface E( w) from any nonzero initial weight vector w. Proof: We will proof this theorem by showing that the gradient vector gj at the j th iteration is orthogonal to all previous conjugate directions d i, i< j. From this, it follows that after W steps, the components of the gradient along all directions will have been made zero, and consequently, iteration (5) will have converged to the minimum w
6 Since, g j b+ Hw j (5) we can write, g j + g j Hw ( j + w j ). (53) From (5), ( w j + w j ) α j d j so that (53) reduces to, g j + g j α j Hd j. (55) Not surprisingly, we now pre-multiply equation (55) by dt j, (54) d T( j g j + g ) j α dthd j j j and substitute (5) for α j so that, (56) d T j ( g j + g j ) d T j g j d T dj T Hd j Hd j j (57) d T( j g j + g ) d j Tg j j d T j g j + d T j g j d T j g j (58) (59) d T j g j +. (6) Therefore, from the result in (6), we know that the gradient at the ( j + ) th step is orthogonal to the previous direction d j. To show that the same is also true for all previous conjugate directions d k, k< j, let us premultiply equation (55) by dt k, d T( k g j + g ) j α dthd. (6) j k j Since k< j, the right-hand side of equation (6) is zero by H -orthogonality. Thus, (6) reduces to, d T k ( g j + g j ) (6) d T k g j + d T k g j, k< j. (63) Through induction of equations (63) and (6), we can easily show that, d T k g j, k < j. (64) For example, let k j. From (63), dt j g j + dt j g j, (65) and from (6), so that, dt j g j (66) dt j g j +. (67) Now we can let k j, j 3,, and repeat the steps for k j. This completes the proof
7 F. Constructing conjugate vectors So far in our development, we have assumed a set of mutually H -orthogonal vectors { d j }, but have not talked about how to construct these vectors. The theorem below gives one iterative method for doing just that. Theorem: Let d j be defined as follows:. Let,. Let, d g (68) d j + g j + + β j d j, j, (69) where, gt j + Hd j β j d T j Hd j (7) The construction of (68), (69) and (7) generates W mutually H -orthogonal vectors on a W -dimensional quadratic error surface, such that, d T j Hd i, i j. (7) Proof: First, we will show that our choice of β j in (7) implies, dt j + Hd j. (7) In other words, we will show that consecutive search directions are d T j H, H -orthogonal. Pre-multiplying (69) by d T j Hd j + d T j Hg j + + β j d T j Hd j, (73) and substituting (7) into (73), we get, d T j Hd j + d T gt j + Hd j j Hg j d T dj T Hd j Hd j j (74) d T j Hd j + d T j Hg j + + gt j + Hd j. (75) Next, we demonstrate that if, d T j Hd i dt j + Hd i, i < j, (76) ( means implies ), then, d T j Hd i, i j. (77) This can be shown by induction of (7) and (76); for example, d T Hd d T 3 Hd d T 3 Hd d T 4 Hd (from (7), by construction) (78) (from (78) and (76), by assumption) (79) (from (7), by construction) (8) (from (79) and (76), by assumption) (8) - 7 -
8 d T 4 Hd d T 4 Hd 3 (from (8) and (76), by assumption) (8) (from (7), by construction), etc. (83) Thus, to complete the proof, we need to show that (76) is true. First, we transpose and post-multiply (69) by Hd i, i< j, dt j + Hd i gt j + Hd i + β j d T j Hd i (84) Now, by assumption (76),. Hence, dt j + Hd i gt j + Hd i. (85) From (3), we have that, Since, d T j Hd i w j + w j α j d j, (86) Hw ( j + w ) j α Hd. (87) j j g j Hw j + b, (88) Hw ( j + w ) j g. (89) j + g j Combining equations (87) and (89), α j Hd j g j + g j (9) Hd j ---- ( g. (9) α j + g j ) j We substitute equation (9) into the right-hand side of equation (85) [switching the index from j to i in equation (9)], so that, dt j + Hd i ----g T, (9) α j + ( g i + g i ) i dt j + Hd i ---- ( gt, (93) α j + g i + gt j + g i ) i dt j + Hd i ---- ( gt. (94) α i + g j + g T i g j + ) i Now, we will show that the right-hand side of equation (94) is zero, by showing that, g T k g j, k < j. (95) This will complete the proof. From (68) and (69), we see that d k, k, is a linear combination of all previous gradient directions, d k k g k + γ l g l, (96) l where the γ l are scalar coefficients. For example, d g, (97) - 8 -
9 d g + β d g β g, (98) d 3 g 3 + β d g 3 β g β β g. (99) We have already shown, d T k g j, k < j, () so that if we transpose and post-multiply equation (96) by g j, j> k, d k T g j k g T k g j + γ l g T l g k, () l k g T k g j + γ l g T l g k l [from ()], () g k T g j l Since d g, k γ l g T l g k. (3) d T g j g T g j, j >, [from ()], (4) g T g j, j >. (5) By induction of equations (3) and (5), we can now show that (95) is true; for example, g T g [from (5)] (6) g T g 3 γ g T g 3 [from (3) and (5)] (7) g T g 4 [from (5)] (8) g T g 4 γ g T g 4 [from (3) and (5)] (9) g T 3 g 4 γ g T g 4 + γ g T g 4 [from (3) and (9)], etc. () Thus, so that, g T k g j, k < j, () dt j + Hd i ---- ( gt,. () α i + g j + g T i g j + ) i< j i Recall that equation () presupposes, so that we have shown equation (76) to be true; i.e., dt j + Hd i when d T j Hd i, i < j. (3) Thus, by induction of equations (7) and (76), the construction in (68), (69) and (7) generates mutually H - orthogonal vectors on a quadratic error surface, such that, d T j Hd i, i j, (4) and the proof is complete. d T j Hd i - 9 -
10 G. Computing β j without the Hessian We now have the following algorithm for training the weights of a neural network:. Let, d g, (5) where w is a random initial weight vector.. For j let, w j + w j + α j d j where, (6) α j d T j g j d T j Hd j (7) 3. Let, d j + g j + + β j d j, (8) where, gt j + Hd j β j d T j Hd j (9) 4. Iterate steps and 3. Ideally, we would like to compute both α j and β j without explicit computation of H. Let us look at β j first. Substituting equation (9), Hd j ---- ( g α j + g j ) j () into equation (9), we get, gt j + ( g j + g j ) β j () d T j ( g j + g j ) which is known as the Hestenes-Stiefel expression for β j. Expression () can be further simplified. From equation (8), d j g j + β j d j. () Then, transposing and post-multiplying equation () by g j, we get, d j T g j g T j g j + β j dt j g j. (3) Since d T k g j, k < j, (4) equation (3) reduces to, d j T g j g T j g j, (5) g T j g j d T j g j. (6) - -
11 Consequently, equation () reduces to, gt j + ( g j + g j ) β j g T j g j (7) which is known as the Polak-Ribiere expression for β j. Finally since, g T k g j, k < j, (8) equation (7) further reduces to, gt j' + g j + β j g T j g j (9) which is known as the Fletcher-Reeves expression for β j. Note that none of the three β j expressions (equations (), (7) and (9), respectively) requires computation of the local Hessian H. Also, note that on a purely quadratic error surface, the three expressions above are identical. Since a neural network error function is typically only approximately quadratic, however, the different expressions for β j will give slightly different results. According to [], the Polak-Ribiere form in (7) usually gives slightly better results in practice than the other two expressions. With (7), when the algorithm is making poor progress, β j will be approximately zero, so that the search direction will be reset to the negative gradient, effectively re-initializing or restarting the conjugate gradient algorithm. H. Computing α j without the Hessian As we will show below, rather than compute α j (5), (the step size along each search direction) through equation α j d T j g j p (3) d T j Hd j we can instead determine α j through a line minimization along d j. We will show below that for a quadratic error surface, this line minimization is equivalent to computing α j explicitly. For a quadratic error surface, E( w) E + b T w + --wt Hw (3) we have that, E( w j + α j d j ) E b T + ( w j + α j d j ) + -- ( w. (3) j + α j d j ) T Hw ( j + α j d j ) We now differentiate equation (3) with respect to α j and equate to zero: E( w j + α j d j ) b α T d j --d j T + j Hw ( j + α j d j ) + -- ( w j + α j d j ) T Hd j (33) Since H is symmetric, d T j Hw ( j + α j d j ) ( w j + α j d j ) T Hd j (34) equation (33) reduces to, b T d j + d T j Hw ( j + α j d j ) d T j b+ d T j Hw j + α j d T j Hd j (35) (36) - -
12 α j d T j Hd j d T j ( b+ Hw j ) (37) Since, g j b+ Hw j equation (37) reduces to,, (38) α j d j T Hd j d T j g j (39) α j d T j g j d T j Hd j (4) Note that equation (4) is equivalent to equation (3), which was derived independently from theoretical considerations. Therefore, we have shown that we may safely replace explicit computation of α j with a line minimization along d j. 3. The conjugate gradient algorithm We have now arrived at the complete conjugate gradient algorithm, which we summarize below:. Choose an initial weight vector w (e.g. vector of small, random weights).. Evaluate g E[ w ] and let,. (4) 3. At step j, j, perform a line minimization along d j such that, 4. Let, d g E( w j + α d j ) E( w j + αd j ), α. (4) w j + w j + α d j 5. Evaluate g j Let, d j + g j + + β j d j where, gt j + ( g j + g j ) β j g T j g j (43) (44). (45) 7. Let j j+ and go to step 3. Note that at no time does the above algorithm require explicit computation of the Hessian H. In fact, the conjugate gradient algorithm requires very little additional computation compared to the steepest descent algorithm. It does, however, exhibit significantly faster convergence properties. In fact, on a W -dimensional quadratic error surface, the conjugate gradient algorithm is guaranteed to converge to the minimum in W steps; no such guarantees exist for the steepest descent algorithm. The main difference between the two algorithms is that the conjugate gradient algorithm exploits a reasonable assumption about the error surface (i.e. locally quadratic) to achieve faster convergence. A neural network error surface will in general, of course, not be globally quadratic. Therefore, in practice, more than W steps will usually be required before convergence to a local minimum. Consequently, in applying the conjugate gradient algorithm to neural network training, we typically reset the algorithm every W steps; that is, - -
13 we reset the search direction to the negative gradient [equation (4)]. Finally we note that throughout our development, we have assumed that the Hessian H is positive definite. When H is not positive definite, the line minimization in step 3 of the algorithm prevents the algorithm from making negative progress (i.e. increasing the error of the neural network). 4. Examples In this section, we will examine the conjugate gradient algorithm for some simple examples, including () a simple quadratic error surface, () a simple non-quadratic error surface, and (3) a simple neural network training example. We will see that the conjugate gradient algorithm significantly outperforms both the steepest descent and gradient descent algorithms in speed of convergence. A. Quadratic example Here we consider the simple quadratic error surface, E ω + ω. (46) For this error surface and from initial weights ( ω, ω ) (, ), Figure below plots the trajectory in weight space of () the conjugate gradient algorithm, () the steepest descent algorithm, and (3) the gradient descent algorithm with a learning rate of η.4. Each trajectory is superimposed on top of a contour plot of E and is stopped when E < 6. Note that the minimum for E occurs at ( ω, ω ) (, ). From Figure, we make several observations. First, as expected, the conjugate gradient algorithm does indeed converge in two steps for a two-dimensional quadratic error surface. By contrast, the steepest descent algorithm requires 5 steps to converge to an error E < 6, even though the two algorithms require approximately the same amount of computation per step. The gradient descent algorithm comes in a distant third at 75 steps to convergence for a learning rate η.4, and this does not count the amount of computation performed in searching for the near-optimal value of the learning rate η. In fact, one of the nicest features of the conjugate gradient algorithm is that it requires no hand-selection of any learning parameter, momentum term, etc ω ω ω A legitimate complaint about Figure is that the conjugate gradient algorithm is specifically designed for a quadratic error surface, so that its superior performance in this case is not surprising in other words, the skeptical reader might easily conclude that the comparison in Figure is rigged. Therefore, we consider a decidedly non-quadratic error surface in the following section. B. Non-quadratic example Here we consider the simple non-quadratic error surface, ω ω ω conjugate gradient steepest descent gradient descent ( η.4) steps to convergence 5 steps to convergence 75 steps to convergence Figure - 3 -
14 E ω - ω - Figure E exp( 5ω ω ), (47) which is plotted in Figure. This error surface is characteristic of some regions in weight space for neural network training problems in that for certain weights, the gradient is close to zero. For this error surface and from initial weights ( ω, ω ) (.,.5), Figure 3 below plots the trajectory in weight space of () the conjugate gradient algorithm, () the steepest descent algorithm, and (3) the gradient descent algorithm with a learning rate of η.. Each trajectory is superimposed on top of a contour plot of E and is stopped when E < 6. Note that the minimum for E occurs at ( ω, ω ) (, ). From Figure 3, we make several observations. First, the conjugate gradient algorithm no longer converges in just two steps (for a two-dimensional surface), since E in (47) is not quadratic; it does, however, converge in four short steps to E < 6. By contrast, the steepest descent algorithm requires 5 steps to converge to an error E < 6, and the gradient descent algorithm comes in a distant third at 6 steps to convergence for a learning rate η. ; once again, the 6 steps for gradient descent does not account for the computation performed in searching for the near-optimal value of the learning rate η. The observant reader may still be skeptical. Why, you ask, did we choose (.,.5) as the initial weight vector for this problem? Caught once again! It turns out that these particular initial weights fall just inside the region of the weight space where the Hessian H is positive definite. The region in Figure 4 encircled by the heavy line corresponds to that region in weight space where H >. Note that (.,.5) falls just inside that region. So how does the conjugate gradient algorithm fare when H is not positive definite? Figure 5 answers this question. In Figure 5, we initialize the weights to ( ω, ω ) (, ), outside the positive-definite region, and once again track the conjugate gradient, steepest descent and gradient descent (with η. ) algorithms. Note that once again, the conjugate gradient algorithm clearly outperforms its two competitors. In fact, the ω ω ω ω ω ω conjugate gradient steepest descent gradient descent ( η.) 4 steps to convergence 5 steps to convergence 6 steps to convergence Figure 3-4 -
15 .6.4. ω ω Figure ω ω ω ω ω ω conjugate gradient steepest descent gradient descent ( η.) 5 steps to convergence 4 steps to convergence 94 steps to convergence Figure 5 gradient descent algorithm degenerates to an awful 94 steps to convergence this poor performance is typical of gradient descent in flat regions of the error surface. C. Neural-network training example Finally, we look at a simple neural network training example. Here we are interested in approximating, y sin( πx), x, (see Figure 6) (48) with a single-layer, three-hidden-unit neural network (see Figure 6) and symmetric sigmoidal activation functions, γ ( u) (49) + e u -- Our training data for this problem consists of N 5 evenly spaced ( xy, ) points sampled from x [, ]. Figure 7 plots, log E N --- (5). Due to an oversight, no connections from the bias unit to the output unit was included in this network. Thus, there are a total of nine trainable weights in this neural network
16 z.8.6 y.4. as a function of the number of steps (or epochs) of the conjugate training algorithm for the first steps. By comparison, Figure 8 illustrates that steepest descent does not come close to approaching the conjugate gradient algorithm s performance, even after 5 steps. Finally, we note that gradient descent (with η.3 ) requires, epochs to reach approximately the same error as the conjugate gradient algorithm does in epochs, as shown in Figure 9. Once again, the conjugate gradient algorithm converges in many fewer steps than either of its two competitors. D. Postscript Let us look at two more issues with regard to the neural network training example of the previous section. First, Figure plots the neural network s progress in approximating y as the number of epochs (steps) increases from to. Note how the conjugate gradient algorithm slowly begins to bend an initially linear approximation to ultimately generate a very good fit of the sine curve. Second, it is interesting to examine how this neural network, with sigmoidal activation functions centered about the origin [see equation (49)] is able to approximate the function y (with non-zero average value) without a bias connection to the output unit z. Let us denote z i, i { 3,, }, as the net contribution from hidden unit i to the output z, so that, In other words, z z + z + z 3 x. (5) x Figure 6 z i ω i h i, i { 3,, }, (5) where ω i denotes the weight from the i th hidden unit to the output, and h i denotes the output of hidden unit i. Figure (top) plots each of the z i, which are easily recognized as different parts of the symmetric sigmoid function in equation (49). To understand how they combine to form an approximation of y, we separately plot z + z and z 3 in Figure (bottom). Note that z 3 is primarily responsible for the overall shape of the z function, while z + z together pull the ends of z to conform to the familiar sine curve shape. 5. Conclusion We have derived the conjugate gradient algorithm and illustrated its properties on some simple examples. We have shown that it outperforms both the gradient descent as well as the steepest descent algorithms in terms of the number of computations required to reach convergence. This improvement has been achieved primarily by accounting for the local properties of error surfaces, allowing us to employ gradient information in a more intelligent manner; furthermore, the conjugate gradient algorithm makes unnecessary the often painful trialand-error selection of tunable learning parameters. [] C. M. Bishop, Chapter 7: Parameter Optimization Algorithms, Neural Networks for Pattern Recognition, Oxford University Press, Oxford,
17 -.5 log E --- N conjugate gradient algorithm - Figure number of epochs -.5 steepest descent algorithm - log E --- N -.5 Figure 8 - conjugate gradient algorithm number of epochs log E --- N -.5 gradient descent algorithm Figure 9 - conjugate gradient algorithm number of epochs - 7 -
18 epoch 5 epochs epochs epochs epochs 9 epochs epochs epochs Figure - 8 -
19 z z 3 - z x z + z z x Figure x - 9 -
Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples
Conjugate gradient training algorithm Steepest descent algorithm So far: Heuristic improvements to gradient descent (momentum Steepest descent training algorithm Can we do better? Definitions: weight vector
More informationChapter 10 Conjugate Direction Methods
Chapter 10 Conjugate Direction Methods An Introduction to Optimization Spring, 2012 1 Wei-Ta Chu 2012/4/13 Introduction Conjugate direction methods can be viewed as being intermediate between the method
More informationJanuary 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13
Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière Hestenes Stiefel January 29, 2014 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes
More informationConjugate Directions for Stochastic Gradient Descent
Conjugate Directions for Stochastic Gradient Descent Nicol N Schraudolph Thore Graepel Institute of Computational Science ETH Zürich, Switzerland {schraudo,graepel}@infethzch Abstract The method of conjugate
More informationThe Conjugate Gradient Method
The Conjugate Gradient Method Lecture 5, Continuous Optimisation Oxford University Computing Laboratory, HT 2006 Notes by Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The notion of complexity (per iteration)
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationThe Conjugate Gradient Algorithm
Optimization over a Subspace Conjugate Direction Methods Conjugate Gradient Algorithm Non-Quadratic Conjugate Gradient Algorithm Optimization over a Subspace Consider the problem min f (x) subject to x
More informationChapter 4. Unconstrained optimization
Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file
More informationChapter 8 Gradient Methods
Chapter 8 Gradient Methods An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Introduction Recall that a level set of a function is the set of points satisfying for some constant. Thus, a point
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationPerformance Surfaces and Optimum Points
CSC 302 1.5 Neural Networks Performance Surfaces and Optimum Points 1 Entrance Performance learning is another important class of learning law. Network parameters are adjusted to optimize the performance
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview
More informationLecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016
Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic
More informationFALL 2018 MATH 4211/6211 Optimization Homework 4
FALL 2018 MATH 4211/6211 Optimization Homework 4 This homework assignment is open to textbook, reference books, slides, and online resources, excluding any direct solution to the problem (such as solution
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More informationGradient Methods Using Momentum and Memory
Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set
More informationOptimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30
Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationNonlinear Programming
Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week
More informationLecture 10: September 26
0-725: Optimization Fall 202 Lecture 0: September 26 Lecturer: Barnabas Poczos/Ryan Tibshirani Scribes: Yipei Wang, Zhiguang Huo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationThe Steepest Descent Algorithm for Unconstrained Optimization
The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem
More informationEigen Vector Descent and Line Search for Multilayer Perceptron
igen Vector Descent and Line Search for Multilayer Perceptron Seiya Satoh and Ryohei Nakano Abstract As learning methods of a multilayer perceptron (MLP), we have the BP algorithm, Newton s method, quasi-
More informationCombining Conjugate Direction Methods with Stochastic Approximation of Gradients
Combining Conjugate Direction Methods with Stochastic Approximation of Gradients Nicol N Schraudolph Thore Graepel Institute of Computational Sciences Eidgenössische Technische Hochschule (ETH) CH-8092
More informationProgramming, numerics and optimization
Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428
More informationLecture Notes: Geometric Considerations in Unconstrained Optimization
Lecture Notes: Geometric Considerations in Unconstrained Optimization James T. Allison February 15, 2006 The primary objectives of this lecture on unconstrained optimization are to: Establish connections
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationSerious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions
BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design
More informationData Mining (Mineria de Dades)
Data Mining (Mineria de Dades) Lluís A. Belanche belanche@lsi.upc.edu Soft Computing Research Group Dept. de Llenguatges i Sistemes Informàtics (Software department) Universitat Politècnica de Catalunya
More informationNew hybrid conjugate gradient methods with the generalized Wolfe line search
Xu and Kong SpringerPlus (016)5:881 DOI 10.1186/s40064-016-5-9 METHODOLOGY New hybrid conjugate gradient methods with the generalized Wolfe line search Open Access Xiao Xu * and Fan yu Kong *Correspondence:
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationGradient Descent. Dr. Xiaowei Huang
Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,
More informationIterative Methods for Solving A x = b
Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http
More informationNotes on Some Methods for Solving Linear Systems
Notes on Some Methods for Solving Linear Systems Dianne P. O Leary, 1983 and 1999 and 2007 September 25, 2007 When the matrix A is symmetric and positive definite, we have a whole new class of algorithms
More informationNumerical Optimization of Partial Differential Equations
Numerical Optimization of Partial Differential Equations Part I: basic optimization concepts in R n Bartosz Protas Department of Mathematics & Statistics McMaster University, Hamilton, Ontario, Canada
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationx k+1 = x k + α k p k (13.1)
13 Gradient Descent Methods Lab Objective: Iterative optimization methods choose a search direction and a step size at each iteration One simple choice for the search direction is the negative gradient,
More informationLearning with Momentum, Conjugate Gradient Learning
Learning with Momentum, Conjugate Gradient Learning Introduction to Neural Networks : Lecture 8 John A. Bullinaria, 2004 1. Visualising Learning 2. Learning with Momentum 3. Learning with Line Searches
More informationToday. Introduction to optimization Definition and motivation 1-dimensional methods. Multi-dimensional methods. General strategies, value-only methods
Optimization Last time Root inding: deinition, motivation Algorithms: Bisection, alse position, secant, Newton-Raphson Convergence & tradeos Eample applications o Newton s method Root inding in > 1 dimension
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationy(x n, w) t n 2. (1)
Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,
More informationThe Conjugate Gradient Method for Solving Linear Systems of Equations
The Conjugate Gradient Method for Solving Linear Systems of Equations Mike Rambo Mentor: Hans de Moor May 2016 Department of Mathematics, Saint Mary s College of California Contents 1 Introduction 2 2
More informationAn Iterative Descent Method
Conjugate Gradient: An Iterative Descent Method The Plan Review Iterative Descent Conjugate Gradient Review : Iterative Descent Iterative Descent is an unconstrained optimization process x (k+1) = x (k)
More informationNumerical Optimization
Numerical Optimization Unit 2: Multivariable optimization problems Che-Rung Lee Scribe: February 28, 2011 (UNIT 2) Numerical Optimization February 28, 2011 1 / 17 Partial derivative of a two variable function
More informationPattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore
Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal
More informationRepeated Eigenvalues and Symmetric Matrices
Repeated Eigenvalues and Symmetric Matrices. Introduction In this Section we further develop the theory of eigenvalues and eigenvectors in two distinct directions. Firstly we look at matrices where one
More informationConstrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.
Optimization Unconstrained optimization One-dimensional Multi-dimensional Newton s method Basic Newton Gauss- Newton Quasi- Newton Descent methods Gradient descent Conjugate gradient Constrained optimization
More informationStep-size Estimation for Unconstrained Optimization Methods
Volume 24, N. 3, pp. 399 416, 2005 Copyright 2005 SBMAC ISSN 0101-8205 www.scielo.br/cam Step-size Estimation for Unconstrained Optimization Methods ZHEN-JUN SHI 1,2 and JIE SHEN 3 1 College of Operations
More informationLecture: Adaptive Filtering
ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More information1 Numerical optimization
Contents 1 Numerical optimization 5 1.1 Optimization of single-variable functions............ 5 1.1.1 Golden Section Search................... 6 1.1. Fibonacci Search...................... 8 1. Algorithms
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationSolutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.
Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. John L. Weatherwax July 7, 2010 wax@alum.mit.edu 1 Chapter 5 (Conjugate Gradient Methods) Notes
More informationLecture 35 Minimization and maximization of functions. Powell s method in multidimensions Conjugate gradient method. Annealing methods.
Lecture 35 Minimization and maximization of functions Powell s method in multidimensions Conjugate gradient method. Annealing methods. We know how to minimize functions in one dimension. If we start at
More informationNonlinearOptimization
1/35 NonlinearOptimization Pavel Kordík Department of Computer Systems Faculty of Information Technology Czech Technical University in Prague Jiří Kašpar, Pavel Tvrdík, 2011 Unconstrained nonlinear optimization,
More informationSecond-order Learning Algorithm with Squared Penalty Term
Second-order Learning Algorithm with Squared Penalty Term Kazumi Saito Ryohei Nakano NTT Communication Science Laboratories 2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 69-2 Japan {saito,nakano}@cslab.kecl.ntt.jp
More informationNumerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09
Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods
More informationAI Programming CS F-20 Neural Networks
AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols
More informationIntelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera
Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks p.1/40 Subjects
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationConjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)
Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps
More informationA recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz
In Neurocomputing 2(-3): 279-294 (998). A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models Isabelle Rivals and Léon Personnaz Laboratoire d'électronique,
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationMath 411 Preliminaries
Math 411 Preliminaries Provide a list of preliminary vocabulary and concepts Preliminary Basic Netwon s method, Taylor series expansion (for single and multiple variables), Eigenvalue, Eigenvector, Vector
More informationA vector from the origin to H, V could be expressed using:
Linear Discriminant Function: the linear discriminant function: g(x) = w t x + ω 0 x is the point, w is the weight vector, and ω 0 is the bias (t is the transpose). Two Category Case: In the two category
More informationLecture 4: Perceptrons and Multilayer Perceptrons
Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationREVIEW OF DIFFERENTIAL CALCULUS
REVIEW OF DIFFERENTIAL CALCULUS DONU ARAPURA 1. Limits and continuity To simplify the statements, we will often stick to two variables, but everything holds with any number of variables. Let f(x, y) be
More informationNeural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron
More informationStable Dynamic Parameter Adaptation
Stable Dynamic Parameter Adaptation Stefan M. Riiger Fachbereich Informatik, Technische Universitat Berlin Sekr. FR 5-9, Franklinstr. 28/29 10587 Berlin, Germany async~cs. tu-berlin.de Abstract A stability
More informationNeural networks III: The delta learning rule with semilinear activation function
Neural networks III: The delta learning rule with semilinear activation function The standard delta rule essentially implements gradient descent in sum-squared error for linear activation functions. We
More informationECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.
ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, 2015 1 Name: Solution Score: /100 This exam is closed-book. You must show ALL of your work for full credit. Please read the questions carefully.
More informationOptimization Methods
Optimization Methods Decision making Examples: determining which ingredients and in what quantities to add to a mixture being made so that it will meet specifications on its composition allocating available
More informationTopics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems
Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate
More informationPenalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.
AMSC 607 / CMSC 878o Advanced Numerical Optimization Fall 2008 UNIT 3: Constrained Optimization PART 3: Penalty and Barrier Methods Dianne P. O Leary c 2008 Reference: N&S Chapter 16 Penalty and Barrier
More informationClassification with Perceptrons. Reading:
Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters
More informationOptimization Tutorial 1. Basic Gradient Descent
E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationIntroduction to feedforward neural networks
. Problem statement and historical context A. Learning framework Figure below illustrates the basic framework that we will see in artificial neural network learning. We assume that we want to learn a classification
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationNumerical Computation for Deep Learning
Numerical Computation for Deep Learning Lecture slides for Chapter 4 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last modified 2017-10-14 Thanks to Justin Gilmer and Jacob Buckman for helpful
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationQM and Angular Momentum
Chapter 5 QM and Angular Momentum 5. Angular Momentum Operators In your Introductory Quantum Mechanics (QM) course you learned about the basic properties of low spin systems. Here we want to review that
More informationweightchanges alleviates many ofthe above problems. Momentum is an example of an improvement on our simple rst order method that keeps it rst order bu
Levenberg-Marquardt Optimization Sam Roweis Abstract Levenberg-Marquardt Optimization is a virtual standard in nonlinear optimization which signicantly outperforms gradient descent and conjugate gradient
More informationArtificial Neural Networks
Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans:
More informationB553 Lecture 5: Matrix Algebra Review
B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationGradient Descent Training Rule: The Details
Gradient Descent Training Rule: The Details 1 For Perceptrons The whole idea behind gradient descent is to gradually, but consistently, decrease the output error by adjusting the weights. The trick is
More informationRegularization and Optimization of Backpropagation
Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization
More informationArtifical Neural Networks
Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationSupervised Learning in Neural Networks
The Norwegian University of Science and Technology (NTNU Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationClassic K -means clustering. Classic K -means example (K = 2) Finding the optimal w k. Finding the optimal s n J =
Review of classic (GOF K -means clustering x 2 Fall 2015 x 1 Lecture 8, February 24, 2015 K-means is traditionally a clustering algorithm. Learning: Fit K prototypes w k (the rows of some matrix, W to
More information