Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples
|
|
- Merry Hodge
- 5 years ago
- Views:
Transcription
1 Conjugate gradient training algorithm Steepest descent algorithm So far: Heuristic improvements to gradient descent (momentum Steepest descent training algorithm Can we do better? Definitions: weight vector at step j w j E[ w j ] gradient at step j search direction at step j Next: Conjugate gradient training algorithm Overview Derivation Examples Steepest descent algorithm Remember previous examples Choose an initial weight vector w and let d g Perform line minimization along, j, such that: E( w j η E( w j η, η 3 Let w j w j η 4 Evaluate 5 Let 6 Let j j and go to step E ω 5 E ω ω ω
2 Remember previous examples Steepest descent algorithm examples ω - E( ω, ω exp( 5ω ω - ω ω ω steepest descent 5 steps to convergence quadratic ω 5 ω steepest descent 4 steps to convergence non-quadratic Conjugate gradient algorithm (a sneak peek Conjugate gradient algorithm (sneak peak Choose an initial weight vector w and let d g Perform a line minimization along, such that: E( w j η E( w j η, η 3 Let w j w j η 4 Evaluate 5 Let where, ( g j 6 Let j j and go to step ω ω conjugate gradient steps to convergence quadratic ω 5 5 ω conjugate gradient 5 steps to convergence non-quadratic
3 SD vs CG Conjugate gradients: a first look In steepest descent: Key difference: new search direction Very little additional computation (over steepest descent No more oscillation back and forth gwt [ ( ] d( t (What does this mean? Key question: Why/How does this improve things? Knowledge of local quadratic properties of error surface Conjugate gradients: a first look Non-interfering directions: gwt [ ( ηd( t ] d( t, η (What the #$@!# does this mean? How do we achieve non-interfering directions? FAC: gwt [ ( ηd( t ] d( t, η implies d( t Hd( t (H-orthogonality, conjugacy Hmmm need to pay attention to nd-order properties of error surface E( w E b w --w Hw
4 Show H-orthogonality requirement Approximate about : w gw ( Let w w( t : by st-order aylor approximation gw ( gw ( ( w w { gw ( } gw ( gwt [ ( ] [ w w( t ] { gwt [ ( ]} Evaluate gw ( at w w( t ηd( t : gwt [ ( ηd( t ] gwt [ ( ] ηd( t H Show H-orthogonality requirement gwt [ ( ηd( t ] gwt [ ( ] ηd( t H Post-multiply by d( t : Left-hand side: gwt [ ( ηd( t ] d( t Right-hand side: (implication (assumption gwt [ ( ] d( t ηd( t Hd( t ηd( t Hd( t d( t Hd( t How do we achieve non-interfering directions? Proven fact: gwt [ ( ηd( t ] d( t, η implies d( t Hd( t (H-orthogonality Derivation of conjugate gradient algorithm Local quadratic assumption: E( w E b w --w Hw Assume: W mutually conjugate vectors Key: need to construct consecutive search directions d are conjugate ( H -orthogonal! (Side note: What is the implicit assumption of SD? that w Initial weight vector Question: How to converge to w in (at most W steps?
5 Step-wise optimization W ( w w α i i (why can I do this? Linear independence of conjugate directions heorem: For a positive-definite square matrix H, H - orthogonal vectors { d, d,, d k } are linearly independent W w w α i i Proof: Linear independence: α d α d α k d k iff α i, i j w j w α i w j i w j Linear independence of conjugate directions Linear independence: α d α d α k d k iff α i, i Pre-multiply by d i H : α d i Hd α d i Hd α k d i Hd k d i H However: Linear independence of conjugate directions α d i Hd α d i Hd α k d i Hd k reduces to: α i d i α d i Hd α d i Hd α k d i Hd k d i > herefore: (by assumption Note (by assumption: d i H, i j α i, i {,,, k}
6 Linear independence of conjugate directions From linear independence: H -orthogonal vectors form a complete basis set Any vector v can be expressed as: v W i α i So, why did we need this result? Step-wise optimization W ( w w α i i W w w α i i j w j w α i w j i w j (Ah-ha! So where are we now? On locally quadratic surface, can converge to minimum in, at most, W steps using: w j w j, j {,,, W} Big Questions: How to choose step size? How to construct conjugate directions? How can we do everything without computing H? Computing the correct step size Given: a set of W conjugate vectors Pre-multiply by H : W ( w w α i i Hw ( w d j H α i W i Hw ( w α i W i
7 Computing the correct step size Computing the correct step size By Hw ( w α i i H -orthogonality (conjugacy assumed: W Hw ( w H (why? Also: E( w E b w --w Hw gw ( b Hw At minimum: gw ( b Hw Hw b Computing the correct step size Computing the correct step size Hw ( w H Hw ( b Hw H ( b Hw H b ( b Hw d w j H w α i j Pre-multiply by H : Hw j Hw α i Hw j Hw j i j i ( b Hw H (what s the problem? ( b Hw j H
8 Computing the correct step size Important consequence H ( b Hw j H b Hw j (woo-hoo! heorem: Assuming a W -dimensional quadratic error surface, E( w E b w --w Hw and H -orthogonal vectors, i {,,, W} : w j w j H will converge in at most W steps to the minimum w (for what error surface? Why is this so? Orthogonality of gradient to previous search directions FAC: b Hw j d k, k w j w j How is this important? How is this different from steepest descent? Let s show that this is true Hw ( j w j ( w j w j H
9 Orthogonality of gradient to previous search directions Orthogonality of gradient to previous search directions H H ( d dj Hd j H j Pre-multiply by : d ( j g j d ( j g dh j j ( d dj Hd j H j d k, k j (need to show for all k< j Orthogonality of gradient to previous search directions H Orthogonality of gradient to previous search directions d k d k, k< j d k Pre-multiply by : d ( k g dh k j d k ( d k d k, k< j (why? By induction: d k, k For example: d j d j d j d j
10 So where are we now? On locally quadratic surface, can converge to minimum in, at most, W steps using: w j, j {,,, W} w j H Remaining Big Questions: How to construct conjugate directions? How can we do everything without computing H? heorem: Let Let g Let, d be defined as follows:, j, H H his construction generates W mutually H -orthogonal vectors, such that:, i j : Part I : Part I First goal: Show that, H H H d j H H H Begin with: Pre-multiply by H : H H H H d H j H d dj Hd j H j H H H d j H
11 Second goal, show that: d j, i Begin with: ranspose and post-multiply by, i< j : g j w j w j Hw j w j Remember that: w j ( α H j Hw j b Hw j w j ( g j w j g j (by assumption H (why? H H ---- ( ----g j ( g i g i α i g j (from before ---- ( g j g i g j g i α i ----g j ( g i g i α i ---- ( g i g i, i< j α i
12 ---- ( g i g i, i< j Now need to show that: g k, k α i From construction: d g, j so that we see that: d j, i is proven d k k g k γ l g l l (why? For example: d g d k k g k γ l g l l d g β d g β g ranspose and post-multiply by, k< j: d 3 g 3 β d g 3 β g β β g d k k g k γ l g l g k l, k g k γ l g l g k l (why? d k, k (because
13 k g k γ l g l g k d k, k l g k γ l g l g k, k g, j > By induction: k l g k k l Since g : d γ l g l g k d g, j > (why? g g (why? g g 3 γ g g 3 g g 4 (why? g g 4 γ g g 4 (why? (why? g, j > g 3 g 4 γ g g 4 γ g g 4 (why? hus: Where we are: g k, k d j H (Part I d j, i (Part II ---- ( g i g i, i< j α i d j, i Does this show what we want?, i j Kinda
14 home stretch d j H (Part I d j, i (Part II By induction: d Hd (Part I d 3 Hd (Part II d 3 Hd (Part I d 4 Hd (Part II d 4 Hd (Part II d 4 Hd 3 (Part I Etc, etc, etc : the So where are we now? Choose an initial weight vector w and let d g Update weight vector: w j, j {,,, W}, H w j 3 Evaluate H 4 Let where H 5 Let j j and go to step What s the problem? So where are we now? Computing without H Remaining Big Question: How can we do everything without computing H? From earlier: H H wo areas: H ---- ( H H H or ( ( (Hestenes-Stiefel
15 Computing without H Computing without H ranspose and post-multiply by : Since: d k, k g j g j d j ( ( d k, k ( g j (Polak-Ribiere Computing without H Computing without H ( g j g k, k hree choices: ( ( (Hestenes-Stiefel g j ( g j (Polak-Ribiere g j (Fletcher-Reeves g j (Fletcher-Reeves Which is best?
16 Computing without H Computing without H E( w E b w --w Hw Key: Replace, H with line minimization E( w j E b ( w j -- ( w j Hw ( j E( w j b ( Hw ( j ( ( w j H Computing without H Computing without H E( w j b ( Hw ( j b Hw j H ( ( w j H Since H is symmetric: H ( b Hw j Hw ( j ( w j H E( w j b Hw ( j b Hw ( j b Hw j H b Hw j H H Conclusion: line minimization computation
17 Complete conjugate gradient algorithm Choose an initial weight vector w and let d g Perform a line minimization along, such that: E( w j α E( w j α, α 3 Let w j w j α 4 Evaluate 5 Let where, ( g j 6 Let j j and go to step (Polak-Ribiere Comments Exploitation of reasonable assumption about local quadratic nature of error surface Little additional computation beyond steepest descent No Hessian computation required No hand-tuning of learning rate In practice, conjugate gradient algorithm must be reset every W steps (Why? What about violations of H > assumption? Quadratic example Nonquadratic example 5 5 ω ω steepest descent 5 steps to convergence ω 5 E ω ω ω conjugate gradient steps to convergence ω - E( ω, ω exp( 5ω ω - ω
18 Nonquadratic example Nonquadratic example ω ω - - ω ω 4 steepest descent 5 steps to convergence ω 4 conjugate gradient 4 steps to convergence -4-6 ω -4-4 ( ω, ω (, 5 (why these initial weights? Region where H > Nonquadratic example ( H < Simple NN training example 5 5 y sin( πx, x ω ω z y 6 4 ω steepest descent 4 steps to convergence ω conjugate gradient 5 steps to convergence ( ω, ω (, (gradient descent 94 steps! x x
19 Simple NN training example Error convergence Simple NN training example Error convergence -5-5 steepest descent algorithm log E --- N - -5 conjugate gradient algorithm log E N conjugate gradient algorithm number of epochs number of epochs Simple NN training example -5 Error convergence A closer look at convergence epoch epochs log E --- N - -5 gradient descent algorithm epochs epochs - conjugate gradient algorithm number of epochs
20 A closer look at convergence Final NN approximation: a closer look 8 epochs 8 93 epochs Hidden unit outputs ( z, z and z z epochs epochs - z 3 z Final NN approximation: a closer look Conjugate gradient conclusions Hidden unit outputs ( z, z and z 3 5 z z Exploitation of reasonable assumption about local quadratic nature of error surface 5 Little additional computation beyond steepest descent -5 - x z 3 No Hessian computation required No hand-tuning of learning rate -5 x Much faster rate of convergence
Conjugate gradient algorithm for training neural networks
. Introduction Recall that in the steepest-descent neural network training algorithm, consecutive line-search directions are orthogonal, such that, where, gwt [ ( + ) ] denotes E[ w( t + ) ], the gradient
More informationChapter 10 Conjugate Direction Methods
Chapter 10 Conjugate Direction Methods An Introduction to Optimization Spring, 2012 1 Wei-Ta Chu 2012/4/13 Introduction Conjugate direction methods can be viewed as being intermediate between the method
More informationJanuary 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13
Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière Hestenes Stiefel January 29, 2014 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes
More informationThe Conjugate Gradient Algorithm
Optimization over a Subspace Conjugate Direction Methods Conjugate Gradient Algorithm Non-Quadratic Conjugate Gradient Algorithm Optimization over a Subspace Consider the problem min f (x) subject to x
More informationFALL 2018 MATH 4211/6211 Optimization Homework 4
FALL 2018 MATH 4211/6211 Optimization Homework 4 This homework assignment is open to textbook, reference books, slides, and online resources, excluding any direct solution to the problem (such as solution
More informationChapter 4. Unconstrained optimization
Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file
More informationThe Conjugate Gradient Method
The Conjugate Gradient Method Lecture 5, Continuous Optimisation Oxford University Computing Laboratory, HT 2006 Notes by Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The notion of complexity (per iteration)
More informationLearning with Momentum, Conjugate Gradient Learning
Learning with Momentum, Conjugate Gradient Learning Introduction to Neural Networks : Lecture 8 John A. Bullinaria, 2004 1. Visualising Learning 2. Learning with Momentum 3. Learning with Line Searches
More informationNumerical Optimization of Partial Differential Equations
Numerical Optimization of Partial Differential Equations Part I: basic optimization concepts in R n Bartosz Protas Department of Mathematics & Statistics McMaster University, Hamilton, Ontario, Canada
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview
More informationConjugate Directions for Stochastic Gradient Descent
Conjugate Directions for Stochastic Gradient Descent Nicol N Schraudolph Thore Graepel Institute of Computational Science ETH Zürich, Switzerland {schraudo,graepel}@infethzch Abstract The method of conjugate
More informationProgramming, numerics and optimization
Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationNumerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09
Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationNonlinear Programming
Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week
More informationLecture 10: September 26
0-725: Optimization Fall 202 Lecture 0: September 26 Lecturer: Barnabas Poczos/Ryan Tibshirani Scribes: Yipei Wang, Zhiguang Huo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationLecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016
Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic
More informationMA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS
MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS 1. Please write your name and student number clearly on the front page of the exam. 2. The exam is
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationLecture 35 Minimization and maximization of functions. Powell s method in multidimensions Conjugate gradient method. Annealing methods.
Lecture 35 Minimization and maximization of functions Powell s method in multidimensions Conjugate gradient method. Annealing methods. We know how to minimize functions in one dimension. If we start at
More informationSolutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.
Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. John L. Weatherwax July 7, 2010 wax@alum.mit.edu 1 Chapter 5 (Conjugate Gradient Methods) Notes
More informationCombining Conjugate Direction Methods with Stochastic Approximation of Gradients
Combining Conjugate Direction Methods with Stochastic Approximation of Gradients Nicol N Schraudolph Thore Graepel Institute of Computational Sciences Eidgenössische Technische Hochschule (ETH) CH-8092
More informationSerious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions
BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationFirst Published on: 11 October 2006 To link to this article: DOI: / URL:
his article was downloaded by:[universitetsbiblioteet i Bergen] [Universitetsbiblioteet i Bergen] On: 12 March 2007 Access Details: [subscription number 768372013] Publisher: aylor & Francis Informa Ltd
More informationLec10p1, ORF363/COS323
Lec10 Page 1 Lec10p1, ORF363/COS323 This lecture: Conjugate direction methods Conjugate directions Conjugate Gram-Schmidt The conjugate gradient (CG) algorithm Solving linear systems Leontief input-output
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationNeuro-Fuzzy Comp. Ch. 4 March 24, R p
4 Feedforward Multilayer Neural Networks part I Feedforward multilayer neural networks (introduced in sec 17) with supervised error correcting learning are used to approximate (synthesise) a non-linear
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationExploring the energy landscape
Exploring the energy landscape ChE210D Today's lecture: what are general features of the potential energy surface and how can we locate and characterize minima on it Derivatives of the potential energy
More informationLearning in State-Space Reinforcement Learning CIS 32
Learning in State-Space Reinforcement Learning CIS 32 Functionalia Syllabus Updated: MIDTERM and REVIEW moved up one day. MIDTERM: Everything through Evolutionary Agents. HW 2 Out - DUE Sunday before the
More information7.3 The Jacobi and Gauss-Siedel Iterative Techniques. Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP.
7.3 The Jacobi and Gauss-Siedel Iterative Techniques Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP. 7.3 The Jacobi and Gauss-Siedel Iterative Techniques
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 3: Iterative Methods PD
More informationweightchanges alleviates many ofthe above problems. Momentum is an example of an improvement on our simple rst order method that keeps it rst order bu
Levenberg-Marquardt Optimization Sam Roweis Abstract Levenberg-Marquardt Optimization is a virtual standard in nonlinear optimization which signicantly outperforms gradient descent and conjugate gradient
More information1 Numerical optimization
Contents 1 Numerical optimization 5 1.1 Optimization of single-variable functions............ 5 1.1.1 Golden Section Search................... 6 1.1. Fibonacci Search...................... 8 1. Algorithms
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationx k+1 = x k + α k p k (13.1)
13 Gradient Descent Methods Lab Objective: Iterative optimization methods choose a search direction and a step size at each iteration One simple choice for the search direction is the negative gradient,
More informationThe Conjugate Gradient Method
The Conjugate Gradient Method Jason E. Hicken Aerospace Design Lab Department of Aeronautics & Astronautics Stanford University 14 July 2011 Lecture Objectives describe when CG can be used to solve Ax
More informationNotes on Some Methods for Solving Linear Systems
Notes on Some Methods for Solving Linear Systems Dianne P. O Leary, 1983 and 1999 and 2007 September 25, 2007 When the matrix A is symmetric and positive definite, we have a whole new class of algorithms
More informationCOVER...72 LECUN, SIMARD, PEARLMUTTER...72 SILVA E ALMEIDA...72 ALMEIDA S ADAPTIVE STEPSIZE...72
Table of Contents CHAPTER IV - DESIGNING AND TRAINING MLPS...3 2. CONTROLLING LEARNING IN PRACTICE...4 3. OTHER SEARCH PROCEDURES...15 4. STOP CRITERIA...29 5. HOW GOOD ARE MLPS AS LEARNING MACHINES?...33
More informationECE580 Exam 2 November 01, Name: Score: / (20 points) You are given a two data sets
ECE580 Exam 2 November 01, 2011 1 Name: Score: /100 You must show ALL of your work for full credit. This exam is closed-book. Calculators may NOT be used. Please leave fractions as fractions, etc. I do
More informationData Mining (Mineria de Dades)
Data Mining (Mineria de Dades) Lluís A. Belanche belanche@lsi.upc.edu Soft Computing Research Group Dept. de Llenguatges i Sistemes Informàtics (Software department) Universitat Politècnica de Catalunya
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 Part 4: Iterative Methods PD
More informationAdministration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6
Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects
More informationDeterministic convergence of conjugate gradient method for feedforward neural networks
Deterministic convergence of conjugate gradient method for feedforard neural netorks Jian Wang a,b,c, Wei Wu a, Jacek M. Zurada b, a School of Mathematical Sciences, Dalian University of Technology, Dalian,
More informationECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.
ECE580 Exam 1 October 4, 2012 1 Name: Solution Score: /100 You must show ALL of your work for full credit. This exam is closed-book. Calculators may NOT be used. Please leave fractions as fractions, etc.
More informationy(x n, w) t n 2. (1)
Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,
More informationChapter 6: Derivative-Based. optimization 1
Chapter 6: Derivative-Based Optimization Introduction (6. Descent Methods (6. he Method of Steepest Descent (6.3 Newton s Methods (NM (6.4 Step Size Determination (6.5 Nonlinear Least-Squares Problems
More informationArtifical Neural Networks
Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................
More information1 Numerical optimization
Contents Numerical optimization 5. Optimization of single-variable functions.............................. 5.. Golden Section Search..................................... 6.. Fibonacci Search........................................
More informationTo keep things simple, let us just work with one pattern. In that case the objective function is defined to be. E = 1 2 xk d 2 (1)
Backpropagation To keep things simple, let us just work with one pattern. In that case the objective function is defined to be E = 1 2 xk d 2 (1) where K is an index denoting the last layer in the network
More informationComputational Intelligence Winter Term 2017/18
Computational Intelligence Winter Term 207/8 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Plan for Today Single-Layer Perceptron Accelerated Learning
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationUnconstrained Multivariate Optimization
Unconstrained Multivariate Optimization Multivariate optimization means optimization of a scalar function of a several variables: and has the general form: y = () min ( ) where () is a nonlinear scalar-valued
More information8 Numerical methods for unconstrained problems
8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields
More informationA globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications
A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth
More informationIntelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera
Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks p.1/40 Subjects
More informationComputational Intelligence
Plan for Today Single-Layer Perceptron Computational Intelligence Winter Term 00/ Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Accelerated Learning
More informationLecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i
8.409 An Algorithmist s oolkit December, 009 Lecturer: Jonathan Kelner Lecture Last time Last time, we reduced solving sparse systems of linear equations Ax = b where A is symmetric and positive definite
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationThis lesson should help you work on section 1.9 in the third edition or Section 1.8 in the second updated edition.
Lesson : his lesson should help you work on section.9 in the third edition or Section.8 in the second updated edition. In the last section, we saw many examples of linear transformations with definition
More informationComparison of the Complex Valued and Real Valued Neural Networks Trained with Gradient Descent and Random Search Algorithms
Comparison of the Complex Valued and Real Valued Neural Networks rained with Gradient Descent and Random Search Algorithms Hans Georg Zimmermann, Alexey Minin 2,3 and Victoria Kusherbaeva 3 - Siemens AG
More informationFeedforward Neural Networks. Michael Collins, Columbia University
Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation
More informationStatistics 580 Optimization Methods
Statistics 580 Optimization Methods Introduction Let fx be a given real-valued function on R p. The general optimization problem is to find an x ɛ R p at which fx attain a maximum or a minimum. It is of
More informationMinimization of Static! Cost Functions!
Minimization of Static Cost Functions Robert Stengel Optimal Control and Estimation, MAE 546, Princeton University, 2017 J = Static cost function with constant control parameter vector, u Conditions for
More informationLecture 7: Minimization or maximization of functions (Recipes Chapter 10)
Lecture 7: Minimization or maximization of functions (Recipes Chapter 10) Actively studied subject for several reasons: Commonly encountered problem: e.g. Hamilton s and Lagrange s principles, economics
More informationGLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH
GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH Jie Sun 1 Department of Decision Sciences National University of Singapore, Republic of Singapore Jiapu Zhang 2 Department of Mathematics
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationTraining Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow
Plan training single neuron with continuous activation function training 1-layer of continuous neurons training multi-layer network - back-propagation method single neuron with continuous activation function
More informationFrom Electrons to Materials Properties
From Electrons to Materials Properties: DFT for Engineers and Materials Scientists Funk Fachgebiet Werkstoffe des Bauwesens und Bauchemie Mar-17, 2016 From Electrons to Materials Properties Density Functional
More informationMATH 4211/6211 Optimization Quasi-Newton Method
MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 Quasi-Newton Method Motivation:
More informationNeural networks III: The delta learning rule with semilinear activation function
Neural networks III: The delta learning rule with semilinear activation function The standard delta rule essentially implements gradient descent in sum-squared error for linear activation functions. We
More informationCh 12: Variations on Backpropagation
Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationTHE RELATIONSHIPS BETWEEN CG, BFGS, AND TWO LIMITED-MEMORY ALGORITHMS
Furman University Electronic Journal of Undergraduate Mathematics Volume 12, 5 20, 2007 HE RELAIONSHIPS BEWEEN CG, BFGS, AND WO LIMIED-MEMORY ALGORIHMS ZHIWEI (ONY) QIN Abstract. For the solution of linear
More informationConvex Optimization. Problem set 2. Due Monday April 26th
Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining
More informationNumerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore
Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore Lecture - 13 Steepest Descent Method Hello, welcome back to this series
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationRecursive Least Squares for an Entropy Regularized MSE Cost Function
Recursive Least Squares for an Entropy Regularized MSE Cost Function Deniz Erdogmus, Yadunandana N. Rao, Jose C. Principe Oscar Fontenla-Romero, Amparo Alonso-Betanzos Electrical Eng. Dept., University
More informationIntroduction to Optimization
Introduction to Optimization Gradient-based Methods Marc Toussaint U Stuttgart Gradient descent methods Plain gradient descent (with adaptive stepsize) Steepest descent (w.r.t. a known metric) Conjugate
More informationOn fast trust region methods for quadratic models with linear constraints. M.J.D. Powell
DAMTP 2014/NA02 On fast trust region methods for quadratic models with linear constraints M.J.D. Powell Abstract: Quadratic models Q k (x), x R n, of the objective function F (x), x R n, are used by many
More informationBulletin of the. Iranian Mathematical Society
ISSN: 1017-060X (Print) ISSN: 1735-8515 (Online) Bulletin of the Iranian Mathematical Society Vol. 43 (2017), No. 7, pp. 2437 2448. Title: Extensions of the Hestenes Stiefel and Pola Ribière Polya conjugate
More informationConjugate Gradient Method
Conjugate Gradient Method direct and indirect methods positive definite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University Three
More informationConjugate Gradient (CG) Method
Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous
More informationConjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.
Lab 1 Conjugate-Gradient Lab Objective: Learn about the Conjugate-Gradient Algorithm and its Uses Descent Algorithms and the Conjugate-Gradient Method There are many possibilities for solving a linear
More informationEigen Vector Descent and Line Search for Multilayer Perceptron
igen Vector Descent and Line Search for Multilayer Perceptron Seiya Satoh and Ryohei Nakano Abstract As learning methods of a multilayer perceptron (MLP), we have the BP algorithm, Newton s method, quasi-
More informationMachine Learning: The Perceptron. Lecture 06
Machine Learning: he Perceptron Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu 1 McCulloch-Pitts Neuron Function 0 1 w 0 activation / output function 1 w 1 w w
More informationAI Programming CS F-20 Neural Networks
AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols
More informationThe Method of Conjugate Directions 21
The Method of Conjugate Directions 1 + 1 i : 7. The Method of Conjugate Directions 7.1. Conjugacy Steepest Descent often finds itself taking steps in the same direction as earlier steps (see Figure 8).
More informationConjugate Gradient Tutorial
Conjugate Gradient Tutorial Prof. Chung-Kuan Cheng Computer Science and Engineering Department University of California, San Diego ckcheng@ucsd.edu December 1, 2015 Prof. Chung-Kuan Cheng (UC San Diego)
More informationOptimization Methods
Optimization Methods Categorization of Optimization Problems Continuous Optimization Discrete Optimization Combinatorial Optimization Variational Optimization Common Optimization Concepts in Computer Vision
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationOn Some Variational Optimization Problems in Classical Fluids and Superfluids
On Some Variational Optimization Problems in Classical Fluids and Superfluids Bartosz Protas Department of Mathematics & Statistics McMaster University, Hamilton, Ontario, Canada URL: http://www.math.mcmaster.ca/bprotas
More information