Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

Size: px
Start display at page:

Download "Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples"

Transcription

1 Conjugate gradient training algorithm Steepest descent algorithm So far: Heuristic improvements to gradient descent (momentum Steepest descent training algorithm Can we do better? Definitions: weight vector at step j w j E[ w j ] gradient at step j search direction at step j Next: Conjugate gradient training algorithm Overview Derivation Examples Steepest descent algorithm Remember previous examples Choose an initial weight vector w and let d g Perform line minimization along, j, such that: E( w j η E( w j η, η 3 Let w j w j η 4 Evaluate 5 Let 6 Let j j and go to step E ω 5 E ω ω ω

2 Remember previous examples Steepest descent algorithm examples ω - E( ω, ω exp( 5ω ω - ω ω ω steepest descent 5 steps to convergence quadratic ω 5 ω steepest descent 4 steps to convergence non-quadratic Conjugate gradient algorithm (a sneak peek Conjugate gradient algorithm (sneak peak Choose an initial weight vector w and let d g Perform a line minimization along, such that: E( w j η E( w j η, η 3 Let w j w j η 4 Evaluate 5 Let where, ( g j 6 Let j j and go to step ω ω conjugate gradient steps to convergence quadratic ω 5 5 ω conjugate gradient 5 steps to convergence non-quadratic

3 SD vs CG Conjugate gradients: a first look In steepest descent: Key difference: new search direction Very little additional computation (over steepest descent No more oscillation back and forth gwt [ ( ] d( t (What does this mean? Key question: Why/How does this improve things? Knowledge of local quadratic properties of error surface Conjugate gradients: a first look Non-interfering directions: gwt [ ( ηd( t ] d( t, η (What the #$@!# does this mean? How do we achieve non-interfering directions? FAC: gwt [ ( ηd( t ] d( t, η implies d( t Hd( t (H-orthogonality, conjugacy Hmmm need to pay attention to nd-order properties of error surface E( w E b w --w Hw

4 Show H-orthogonality requirement Approximate about : w gw ( Let w w( t : by st-order aylor approximation gw ( gw ( ( w w { gw ( } gw ( gwt [ ( ] [ w w( t ] { gwt [ ( ]} Evaluate gw ( at w w( t ηd( t : gwt [ ( ηd( t ] gwt [ ( ] ηd( t H Show H-orthogonality requirement gwt [ ( ηd( t ] gwt [ ( ] ηd( t H Post-multiply by d( t : Left-hand side: gwt [ ( ηd( t ] d( t Right-hand side: (implication (assumption gwt [ ( ] d( t ηd( t Hd( t ηd( t Hd( t d( t Hd( t How do we achieve non-interfering directions? Proven fact: gwt [ ( ηd( t ] d( t, η implies d( t Hd( t (H-orthogonality Derivation of conjugate gradient algorithm Local quadratic assumption: E( w E b w --w Hw Assume: W mutually conjugate vectors Key: need to construct consecutive search directions d are conjugate ( H -orthogonal! (Side note: What is the implicit assumption of SD? that w Initial weight vector Question: How to converge to w in (at most W steps?

5 Step-wise optimization W ( w w α i i (why can I do this? Linear independence of conjugate directions heorem: For a positive-definite square matrix H, H - orthogonal vectors { d, d,, d k } are linearly independent W w w α i i Proof: Linear independence: α d α d α k d k iff α i, i j w j w α i w j i w j Linear independence of conjugate directions Linear independence: α d α d α k d k iff α i, i Pre-multiply by d i H : α d i Hd α d i Hd α k d i Hd k d i H However: Linear independence of conjugate directions α d i Hd α d i Hd α k d i Hd k reduces to: α i d i α d i Hd α d i Hd α k d i Hd k d i > herefore: (by assumption Note (by assumption: d i H, i j α i, i {,,, k}

6 Linear independence of conjugate directions From linear independence: H -orthogonal vectors form a complete basis set Any vector v can be expressed as: v W i α i So, why did we need this result? Step-wise optimization W ( w w α i i W w w α i i j w j w α i w j i w j (Ah-ha! So where are we now? On locally quadratic surface, can converge to minimum in, at most, W steps using: w j w j, j {,,, W} Big Questions: How to choose step size? How to construct conjugate directions? How can we do everything without computing H? Computing the correct step size Given: a set of W conjugate vectors Pre-multiply by H : W ( w w α i i Hw ( w d j H α i W i Hw ( w α i W i

7 Computing the correct step size Computing the correct step size By Hw ( w α i i H -orthogonality (conjugacy assumed: W Hw ( w H (why? Also: E( w E b w --w Hw gw ( b Hw At minimum: gw ( b Hw Hw b Computing the correct step size Computing the correct step size Hw ( w H Hw ( b Hw H ( b Hw H b ( b Hw d w j H w α i j Pre-multiply by H : Hw j Hw α i Hw j Hw j i j i ( b Hw H (what s the problem? ( b Hw j H

8 Computing the correct step size Important consequence H ( b Hw j H b Hw j (woo-hoo! heorem: Assuming a W -dimensional quadratic error surface, E( w E b w --w Hw and H -orthogonal vectors, i {,,, W} : w j w j H will converge in at most W steps to the minimum w (for what error surface? Why is this so? Orthogonality of gradient to previous search directions FAC: b Hw j d k, k w j w j How is this important? How is this different from steepest descent? Let s show that this is true Hw ( j w j ( w j w j H

9 Orthogonality of gradient to previous search directions Orthogonality of gradient to previous search directions H H ( d dj Hd j H j Pre-multiply by : d ( j g j d ( j g dh j j ( d dj Hd j H j d k, k j (need to show for all k< j Orthogonality of gradient to previous search directions H Orthogonality of gradient to previous search directions d k d k, k< j d k Pre-multiply by : d ( k g dh k j d k ( d k d k, k< j (why? By induction: d k, k For example: d j d j d j d j

10 So where are we now? On locally quadratic surface, can converge to minimum in, at most, W steps using: w j, j {,,, W} w j H Remaining Big Questions: How to construct conjugate directions? How can we do everything without computing H? heorem: Let Let g Let, d be defined as follows:, j, H H his construction generates W mutually H -orthogonal vectors, such that:, i j : Part I : Part I First goal: Show that, H H H d j H H H Begin with: Pre-multiply by H : H H H H d H j H d dj Hd j H j H H H d j H

11 Second goal, show that: d j, i Begin with: ranspose and post-multiply by, i< j : g j w j w j Hw j w j Remember that: w j ( α H j Hw j b Hw j w j ( g j w j g j (by assumption H (why? H H ---- ( ----g j ( g i g i α i g j (from before ---- ( g j g i g j g i α i ----g j ( g i g i α i ---- ( g i g i, i< j α i

12 ---- ( g i g i, i< j Now need to show that: g k, k α i From construction: d g, j so that we see that: d j, i is proven d k k g k γ l g l l (why? For example: d g d k k g k γ l g l l d g β d g β g ranspose and post-multiply by, k< j: d 3 g 3 β d g 3 β g β β g d k k g k γ l g l g k l, k g k γ l g l g k l (why? d k, k (because

13 k g k γ l g l g k d k, k l g k γ l g l g k, k g, j > By induction: k l g k k l Since g : d γ l g l g k d g, j > (why? g g (why? g g 3 γ g g 3 g g 4 (why? g g 4 γ g g 4 (why? (why? g, j > g 3 g 4 γ g g 4 γ g g 4 (why? hus: Where we are: g k, k d j H (Part I d j, i (Part II ---- ( g i g i, i< j α i d j, i Does this show what we want?, i j Kinda

14 home stretch d j H (Part I d j, i (Part II By induction: d Hd (Part I d 3 Hd (Part II d 3 Hd (Part I d 4 Hd (Part II d 4 Hd (Part II d 4 Hd 3 (Part I Etc, etc, etc : the So where are we now? Choose an initial weight vector w and let d g Update weight vector: w j, j {,,, W}, H w j 3 Evaluate H 4 Let where H 5 Let j j and go to step What s the problem? So where are we now? Computing without H Remaining Big Question: How can we do everything without computing H? From earlier: H H wo areas: H ---- ( H H H or ( ( (Hestenes-Stiefel

15 Computing without H Computing without H ranspose and post-multiply by : Since: d k, k g j g j d j ( ( d k, k ( g j (Polak-Ribiere Computing without H Computing without H ( g j g k, k hree choices: ( ( (Hestenes-Stiefel g j ( g j (Polak-Ribiere g j (Fletcher-Reeves g j (Fletcher-Reeves Which is best?

16 Computing without H Computing without H E( w E b w --w Hw Key: Replace, H with line minimization E( w j E b ( w j -- ( w j Hw ( j E( w j b ( Hw ( j ( ( w j H Computing without H Computing without H E( w j b ( Hw ( j b Hw j H ( ( w j H Since H is symmetric: H ( b Hw j Hw ( j ( w j H E( w j b Hw ( j b Hw ( j b Hw j H b Hw j H H Conclusion: line minimization computation

17 Complete conjugate gradient algorithm Choose an initial weight vector w and let d g Perform a line minimization along, such that: E( w j α E( w j α, α 3 Let w j w j α 4 Evaluate 5 Let where, ( g j 6 Let j j and go to step (Polak-Ribiere Comments Exploitation of reasonable assumption about local quadratic nature of error surface Little additional computation beyond steepest descent No Hessian computation required No hand-tuning of learning rate In practice, conjugate gradient algorithm must be reset every W steps (Why? What about violations of H > assumption? Quadratic example Nonquadratic example 5 5 ω ω steepest descent 5 steps to convergence ω 5 E ω ω ω conjugate gradient steps to convergence ω - E( ω, ω exp( 5ω ω - ω

18 Nonquadratic example Nonquadratic example ω ω - - ω ω 4 steepest descent 5 steps to convergence ω 4 conjugate gradient 4 steps to convergence -4-6 ω -4-4 ( ω, ω (, 5 (why these initial weights? Region where H > Nonquadratic example ( H < Simple NN training example 5 5 y sin( πx, x ω ω z y 6 4 ω steepest descent 4 steps to convergence ω conjugate gradient 5 steps to convergence ( ω, ω (, (gradient descent 94 steps! x x

19 Simple NN training example Error convergence Simple NN training example Error convergence -5-5 steepest descent algorithm log E --- N - -5 conjugate gradient algorithm log E N conjugate gradient algorithm number of epochs number of epochs Simple NN training example -5 Error convergence A closer look at convergence epoch epochs log E --- N - -5 gradient descent algorithm epochs epochs - conjugate gradient algorithm number of epochs

20 A closer look at convergence Final NN approximation: a closer look 8 epochs 8 93 epochs Hidden unit outputs ( z, z and z z epochs epochs - z 3 z Final NN approximation: a closer look Conjugate gradient conclusions Hidden unit outputs ( z, z and z 3 5 z z Exploitation of reasonable assumption about local quadratic nature of error surface 5 Little additional computation beyond steepest descent -5 - x z 3 No Hessian computation required No hand-tuning of learning rate -5 x Much faster rate of convergence

Conjugate gradient algorithm for training neural networks

Conjugate gradient algorithm for training neural networks . Introduction Recall that in the steepest-descent neural network training algorithm, consecutive line-search directions are orthogonal, such that, where, gwt [ ( + ) ] denotes E[ w( t + ) ], the gradient

More information

Chapter 10 Conjugate Direction Methods

Chapter 10 Conjugate Direction Methods Chapter 10 Conjugate Direction Methods An Introduction to Optimization Spring, 2012 1 Wei-Ta Chu 2012/4/13 Introduction Conjugate direction methods can be viewed as being intermediate between the method

More information

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière Hestenes Stiefel January 29, 2014 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes

More information

The Conjugate Gradient Algorithm

The Conjugate Gradient Algorithm Optimization over a Subspace Conjugate Direction Methods Conjugate Gradient Algorithm Non-Quadratic Conjugate Gradient Algorithm Optimization over a Subspace Consider the problem min f (x) subject to x

More information

FALL 2018 MATH 4211/6211 Optimization Homework 4

FALL 2018 MATH 4211/6211 Optimization Homework 4 FALL 2018 MATH 4211/6211 Optimization Homework 4 This homework assignment is open to textbook, reference books, slides, and online resources, excluding any direct solution to the problem (such as solution

More information

Chapter 4. Unconstrained optimization

Chapter 4. Unconstrained optimization Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Lecture 5, Continuous Optimisation Oxford University Computing Laboratory, HT 2006 Notes by Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The notion of complexity (per iteration)

More information

Learning with Momentum, Conjugate Gradient Learning

Learning with Momentum, Conjugate Gradient Learning Learning with Momentum, Conjugate Gradient Learning Introduction to Neural Networks : Lecture 8 John A. Bullinaria, 2004 1. Visualising Learning 2. Learning with Momentum 3. Learning with Line Searches

More information

Numerical Optimization of Partial Differential Equations

Numerical Optimization of Partial Differential Equations Numerical Optimization of Partial Differential Equations Part I: basic optimization concepts in R n Bartosz Protas Department of Mathematics & Statistics McMaster University, Hamilton, Ontario, Canada

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview

More information

Conjugate Directions for Stochastic Gradient Descent

Conjugate Directions for Stochastic Gradient Descent Conjugate Directions for Stochastic Gradient Descent Nicol N Schraudolph Thore Graepel Institute of Computational Science ETH Zürich, Switzerland {schraudo,graepel}@infethzch Abstract The method of conjugate

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09 Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

Lecture 10: September 26

Lecture 10: September 26 0-725: Optimization Fall 202 Lecture 0: September 26 Lecturer: Barnabas Poczos/Ryan Tibshirani Scribes: Yipei Wang, Zhiguang Huo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS 1. Please write your name and student number clearly on the front page of the exam. 2. The exam is

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Lecture 35 Minimization and maximization of functions. Powell s method in multidimensions Conjugate gradient method. Annealing methods.

Lecture 35 Minimization and maximization of functions. Powell s method in multidimensions Conjugate gradient method. Annealing methods. Lecture 35 Minimization and maximization of functions Powell s method in multidimensions Conjugate gradient method. Annealing methods. We know how to minimize functions in one dimension. If we start at

More information

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. John L. Weatherwax July 7, 2010 wax@alum.mit.edu 1 Chapter 5 (Conjugate Gradient Methods) Notes

More information

Combining Conjugate Direction Methods with Stochastic Approximation of Gradients

Combining Conjugate Direction Methods with Stochastic Approximation of Gradients Combining Conjugate Direction Methods with Stochastic Approximation of Gradients Nicol N Schraudolph Thore Graepel Institute of Computational Sciences Eidgenössische Technische Hochschule (ETH) CH-8092

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

First Published on: 11 October 2006 To link to this article: DOI: / URL:

First Published on: 11 October 2006 To link to this article: DOI: / URL: his article was downloaded by:[universitetsbiblioteet i Bergen] [Universitetsbiblioteet i Bergen] On: 12 March 2007 Access Details: [subscription number 768372013] Publisher: aylor & Francis Informa Ltd

More information

Lec10p1, ORF363/COS323

Lec10p1, ORF363/COS323 Lec10 Page 1 Lec10p1, ORF363/COS323 This lecture: Conjugate direction methods Conjugate directions Conjugate Gram-Schmidt The conjugate gradient (CG) algorithm Solving linear systems Leontief input-output

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Neuro-Fuzzy Comp. Ch. 4 March 24, R p

Neuro-Fuzzy Comp. Ch. 4 March 24, R p 4 Feedforward Multilayer Neural Networks part I Feedforward multilayer neural networks (introduced in sec 17) with supervised error correcting learning are used to approximate (synthesise) a non-linear

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Exploring the energy landscape

Exploring the energy landscape Exploring the energy landscape ChE210D Today's lecture: what are general features of the potential energy surface and how can we locate and characterize minima on it Derivatives of the potential energy

More information

Learning in State-Space Reinforcement Learning CIS 32

Learning in State-Space Reinforcement Learning CIS 32 Learning in State-Space Reinforcement Learning CIS 32 Functionalia Syllabus Updated: MIDTERM and REVIEW moved up one day. MIDTERM: Everything through Evolutionary Agents. HW 2 Out - DUE Sunday before the

More information

7.3 The Jacobi and Gauss-Siedel Iterative Techniques. Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP.

7.3 The Jacobi and Gauss-Siedel Iterative Techniques. Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP. 7.3 The Jacobi and Gauss-Siedel Iterative Techniques Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP. 7.3 The Jacobi and Gauss-Siedel Iterative Techniques

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 3: Iterative Methods PD

More information

weightchanges alleviates many ofthe above problems. Momentum is an example of an improvement on our simple rst order method that keeps it rst order bu

weightchanges alleviates many ofthe above problems. Momentum is an example of an improvement on our simple rst order method that keeps it rst order bu Levenberg-Marquardt Optimization Sam Roweis Abstract Levenberg-Marquardt Optimization is a virtual standard in nonlinear optimization which signicantly outperforms gradient descent and conjugate gradient

More information

1 Numerical optimization

1 Numerical optimization Contents 1 Numerical optimization 5 1.1 Optimization of single-variable functions............ 5 1.1.1 Golden Section Search................... 6 1.1. Fibonacci Search...................... 8 1. Algorithms

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

x k+1 = x k + α k p k (13.1)

x k+1 = x k + α k p k (13.1) 13 Gradient Descent Methods Lab Objective: Iterative optimization methods choose a search direction and a step size at each iteration One simple choice for the search direction is the negative gradient,

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Jason E. Hicken Aerospace Design Lab Department of Aeronautics & Astronautics Stanford University 14 July 2011 Lecture Objectives describe when CG can be used to solve Ax

More information

Notes on Some Methods for Solving Linear Systems

Notes on Some Methods for Solving Linear Systems Notes on Some Methods for Solving Linear Systems Dianne P. O Leary, 1983 and 1999 and 2007 September 25, 2007 When the matrix A is symmetric and positive definite, we have a whole new class of algorithms

More information

COVER...72 LECUN, SIMARD, PEARLMUTTER...72 SILVA E ALMEIDA...72 ALMEIDA S ADAPTIVE STEPSIZE...72

COVER...72 LECUN, SIMARD, PEARLMUTTER...72 SILVA E ALMEIDA...72 ALMEIDA S ADAPTIVE STEPSIZE...72 Table of Contents CHAPTER IV - DESIGNING AND TRAINING MLPS...3 2. CONTROLLING LEARNING IN PRACTICE...4 3. OTHER SEARCH PROCEDURES...15 4. STOP CRITERIA...29 5. HOW GOOD ARE MLPS AS LEARNING MACHINES?...33

More information

ECE580 Exam 2 November 01, Name: Score: / (20 points) You are given a two data sets

ECE580 Exam 2 November 01, Name: Score: / (20 points) You are given a two data sets ECE580 Exam 2 November 01, 2011 1 Name: Score: /100 You must show ALL of your work for full credit. This exam is closed-book. Calculators may NOT be used. Please leave fractions as fractions, etc. I do

More information

Data Mining (Mineria de Dades)

Data Mining (Mineria de Dades) Data Mining (Mineria de Dades) Lluís A. Belanche belanche@lsi.upc.edu Soft Computing Research Group Dept. de Llenguatges i Sistemes Informàtics (Software department) Universitat Politècnica de Catalunya

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 Part 4: Iterative Methods PD

More information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6 Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects

More information

Deterministic convergence of conjugate gradient method for feedforward neural networks

Deterministic convergence of conjugate gradient method for feedforward neural networks Deterministic convergence of conjugate gradient method for feedforard neural netorks Jian Wang a,b,c, Wei Wu a, Jacek M. Zurada b, a School of Mathematical Sciences, Dalian University of Technology, Dalian,

More information

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor. ECE580 Exam 1 October 4, 2012 1 Name: Solution Score: /100 You must show ALL of your work for full credit. This exam is closed-book. Calculators may NOT be used. Please leave fractions as fractions, etc.

More information

y(x n, w) t n 2. (1)

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

More information

Chapter 6: Derivative-Based. optimization 1

Chapter 6: Derivative-Based. optimization 1 Chapter 6: Derivative-Based Optimization Introduction (6. Descent Methods (6. he Method of Steepest Descent (6.3 Newton s Methods (NM (6.4 Step Size Determination (6.5 Nonlinear Least-Squares Problems

More information

Artifical Neural Networks

Artifical Neural Networks Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................

More information

1 Numerical optimization

1 Numerical optimization Contents Numerical optimization 5. Optimization of single-variable functions.............................. 5.. Golden Section Search..................................... 6.. Fibonacci Search........................................

More information

To keep things simple, let us just work with one pattern. In that case the objective function is defined to be. E = 1 2 xk d 2 (1)

To keep things simple, let us just work with one pattern. In that case the objective function is defined to be. E = 1 2 xk d 2 (1) Backpropagation To keep things simple, let us just work with one pattern. In that case the objective function is defined to be E = 1 2 xk d 2 (1) where K is an index denoting the last layer in the network

More information

Computational Intelligence Winter Term 2017/18

Computational Intelligence Winter Term 2017/18 Computational Intelligence Winter Term 207/8 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Plan for Today Single-Layer Perceptron Accelerated Learning

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Unconstrained Multivariate Optimization

Unconstrained Multivariate Optimization Unconstrained Multivariate Optimization Multivariate optimization means optimization of a scalar function of a several variables: and has the general form: y = () min ( ) where () is a nonlinear scalar-valued

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks p.1/40 Subjects

More information

Computational Intelligence

Computational Intelligence Plan for Today Single-Layer Perceptron Computational Intelligence Winter Term 00/ Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Accelerated Learning

More information

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i 8.409 An Algorithmist s oolkit December, 009 Lecturer: Jonathan Kelner Lecture Last time Last time, we reduced solving sparse systems of linear equations Ax = b where A is symmetric and positive definite

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

This lesson should help you work on section 1.9 in the third edition or Section 1.8 in the second updated edition.

This lesson should help you work on section 1.9 in the third edition or Section 1.8 in the second updated edition. Lesson : his lesson should help you work on section.9 in the third edition or Section.8 in the second updated edition. In the last section, we saw many examples of linear transformations with definition

More information

Comparison of the Complex Valued and Real Valued Neural Networks Trained with Gradient Descent and Random Search Algorithms

Comparison of the Complex Valued and Real Valued Neural Networks Trained with Gradient Descent and Random Search Algorithms Comparison of the Complex Valued and Real Valued Neural Networks rained with Gradient Descent and Random Search Algorithms Hans Georg Zimmermann, Alexey Minin 2,3 and Victoria Kusherbaeva 3 - Siemens AG

More information

Feedforward Neural Networks. Michael Collins, Columbia University

Feedforward Neural Networks. Michael Collins, Columbia University Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation

More information

Statistics 580 Optimization Methods

Statistics 580 Optimization Methods Statistics 580 Optimization Methods Introduction Let fx be a given real-valued function on R p. The general optimization problem is to find an x ɛ R p at which fx attain a maximum or a minimum. It is of

More information

Minimization of Static! Cost Functions!

Minimization of Static! Cost Functions! Minimization of Static Cost Functions Robert Stengel Optimal Control and Estimation, MAE 546, Princeton University, 2017 J = Static cost function with constant control parameter vector, u Conditions for

More information

Lecture 7: Minimization or maximization of functions (Recipes Chapter 10)

Lecture 7: Minimization or maximization of functions (Recipes Chapter 10) Lecture 7: Minimization or maximization of functions (Recipes Chapter 10) Actively studied subject for several reasons: Commonly encountered problem: e.g. Hamilton s and Lagrange s principles, economics

More information

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH Jie Sun 1 Department of Decision Sciences National University of Singapore, Republic of Singapore Jiapu Zhang 2 Department of Mathematics

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow Plan training single neuron with continuous activation function training 1-layer of continuous neurons training multi-layer network - back-propagation method single neuron with continuous activation function

More information

From Electrons to Materials Properties

From Electrons to Materials Properties From Electrons to Materials Properties: DFT for Engineers and Materials Scientists Funk Fachgebiet Werkstoffe des Bauwesens und Bauchemie Mar-17, 2016 From Electrons to Materials Properties Density Functional

More information

MATH 4211/6211 Optimization Quasi-Newton Method

MATH 4211/6211 Optimization Quasi-Newton Method MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 Quasi-Newton Method Motivation:

More information

Neural networks III: The delta learning rule with semilinear activation function

Neural networks III: The delta learning rule with semilinear activation function Neural networks III: The delta learning rule with semilinear activation function The standard delta rule essentially implements gradient descent in sum-squared error for linear activation functions. We

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

THE RELATIONSHIPS BETWEEN CG, BFGS, AND TWO LIMITED-MEMORY ALGORITHMS

THE RELATIONSHIPS BETWEEN CG, BFGS, AND TWO LIMITED-MEMORY ALGORITHMS Furman University Electronic Journal of Undergraduate Mathematics Volume 12, 5 20, 2007 HE RELAIONSHIPS BEWEEN CG, BFGS, AND WO LIMIED-MEMORY ALGORIHMS ZHIWEI (ONY) QIN Abstract. For the solution of linear

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore Lecture - 13 Steepest Descent Method Hello, welcome back to this series

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Recursive Least Squares for an Entropy Regularized MSE Cost Function

Recursive Least Squares for an Entropy Regularized MSE Cost Function Recursive Least Squares for an Entropy Regularized MSE Cost Function Deniz Erdogmus, Yadunandana N. Rao, Jose C. Principe Oscar Fontenla-Romero, Amparo Alonso-Betanzos Electrical Eng. Dept., University

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Gradient-based Methods Marc Toussaint U Stuttgart Gradient descent methods Plain gradient descent (with adaptive stepsize) Steepest descent (w.r.t. a known metric) Conjugate

More information

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell DAMTP 2014/NA02 On fast trust region methods for quadratic models with linear constraints M.J.D. Powell Abstract: Quadratic models Q k (x), x R n, of the objective function F (x), x R n, are used by many

More information

Bulletin of the. Iranian Mathematical Society

Bulletin of the. Iranian Mathematical Society ISSN: 1017-060X (Print) ISSN: 1735-8515 (Online) Bulletin of the Iranian Mathematical Society Vol. 43 (2017), No. 7, pp. 2437 2448. Title: Extensions of the Hestenes Stiefel and Pola Ribière Polya conjugate

More information

Conjugate Gradient Method

Conjugate Gradient Method Conjugate Gradient Method direct and indirect methods positive definite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University Three

More information

Conjugate Gradient (CG) Method

Conjugate Gradient (CG) Method Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous

More information

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b. Lab 1 Conjugate-Gradient Lab Objective: Learn about the Conjugate-Gradient Algorithm and its Uses Descent Algorithms and the Conjugate-Gradient Method There are many possibilities for solving a linear

More information

Eigen Vector Descent and Line Search for Multilayer Perceptron

Eigen Vector Descent and Line Search for Multilayer Perceptron igen Vector Descent and Line Search for Multilayer Perceptron Seiya Satoh and Ryohei Nakano Abstract As learning methods of a multilayer perceptron (MLP), we have the BP algorithm, Newton s method, quasi-

More information

Machine Learning: The Perceptron. Lecture 06

Machine Learning: The Perceptron. Lecture 06 Machine Learning: he Perceptron Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu 1 McCulloch-Pitts Neuron Function 0 1 w 0 activation / output function 1 w 1 w w

More information

AI Programming CS F-20 Neural Networks

AI Programming CS F-20 Neural Networks AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols

More information

The Method of Conjugate Directions 21

The Method of Conjugate Directions 21 The Method of Conjugate Directions 1 + 1 i : 7. The Method of Conjugate Directions 7.1. Conjugacy Steepest Descent often finds itself taking steps in the same direction as earlier steps (see Figure 8).

More information

Conjugate Gradient Tutorial

Conjugate Gradient Tutorial Conjugate Gradient Tutorial Prof. Chung-Kuan Cheng Computer Science and Engineering Department University of California, San Diego ckcheng@ucsd.edu December 1, 2015 Prof. Chung-Kuan Cheng (UC San Diego)

More information

Optimization Methods

Optimization Methods Optimization Methods Categorization of Optimization Problems Continuous Optimization Discrete Optimization Combinatorial Optimization Variational Optimization Common Optimization Concepts in Computer Vision

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

On Some Variational Optimization Problems in Classical Fluids and Superfluids

On Some Variational Optimization Problems in Classical Fluids and Superfluids On Some Variational Optimization Problems in Classical Fluids and Superfluids Bartosz Protas Department of Mathematics & Statistics McMaster University, Hamilton, Ontario, Canada URL: http://www.math.mcmaster.ca/bprotas

More information