IV. Performance Optimization

Similar documents
Neural networks. Nuno Vasconcelos ECE Department, UCSD

Topic 5: Non-Linear Regression

Lecture 10 Support Vector Machines II

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

EEE 241: Linear Systems

Lecture Notes on Linear Regression

Which Separator? Spring 1

Generalized Linear Methods

NUMERICAL DIFFERENTIATION

Singular Value Decomposition: Theory and Applications

Chapter Newton s Method

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Report on Image warping

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Multilayer Perceptron (MLP)

EXCE, steepest descent, conjugate gradient & BFGS

Optimization. September 4, 2018

Evaluation of classifiers MLPs

Multi-layer neural networks

Review: Fit a line to N data points

Quadratic speedup for unstructured search - Grover s Al-

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Feature Selection: Part 1

Newton s Method for One - Dimensional Optimization - Theory

Department of Chemical and Biological Engineering LECTURE NOTE II. Chapter 3. Function of Several Variables

Logistic Regression Maximum Likelihood Estimation

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Why feed-forward networks are in a bad shape

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

OPTIMISATION. Introduction Single Variable Unconstrained Optimisation Multivariable Unconstrained Optimisation Linear Programming

This model contains two bonds per unit cell (one along the x-direction and the other along y). So we can rewrite the Hamiltonian as:

Ensemble Methods: Boosting

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Global Sensitivity. Tuesday 20 th February, 2018

FTCS Solution to the Heat Equation

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

1 Convex Optimization

Numerical Heat and Mass Transfer

CHAPTER 3 UNCONSTRAINED OPTIMIZATION

Markov Chain Monte Carlo Lecture 6

2 Finite difference basics

One-sided finite-difference approximations suitable for use with Richardson extrapolation

Lecture 21: Numerical methods for pricing American type derivatives

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

Norms, Condition Numbers, Eigenvalues and Eigenvectors

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Review of Taylor Series. Read Section 1.2

Solutions to exam in SF1811 Optimization, Jan 14, 2015

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Classification as a Regression Problem

1 Derivation of Point-to-Plane Minimization

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Inexact Newton Methods for Inverse Eigenvalue Problems

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Chapter 12. Ordinary Differential Equation Boundary Value (BV) Problems

Optimization. August 30, 2016

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 13

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Multilayer neural networks

2.29 Numerical Fluid Mechanics Fall 2011 Lecture 12

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

6) Derivatives, gradients and Hessian matrices

Statistical pattern recognition

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

4DVAR, according to the name, is a four-dimensional variational method.

Meshless Surfaces. presented by Niloy J. Mitra. An Nguyen

PHYS 705: Classical Mechanics. Calculus of Variations II

Lecture 12: Discrete Laplacian

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

The Geometry of Logit and Probit

Chapter 4: Root Finding

Maximum Likelihood Estimation (MLE)

Introduction to the R Statistical Computing Environment R Programming

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Lecture 10 Support Vector Machines. Oct

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Numerical Solution of Ordinary Differential Equations

Training Convolutional Neural Networks

10-701/ Machine Learning, Fall 2005 Homework 3

1 Matrix representations of canonical matrices

Introduction to the Introduction to Artificial Neural Network

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Least squares cubic splines without B-splines S.K. Lucas

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Calculation of time complexity (3%)

The Second Anti-Mathima on Game Theory

Least Squares Fitting of Data

Electronic Quantum Monte Carlo Calculations of Energies and Atomic Forces for Diatomic and Polyatomic Molecules

Some modelling aspects for the Matlab implementation of MMA

A fast iterative algorithm for support vector data description

10.34 Numerical Methods Applied to Chemical Engineering Fall Homework #3: Systems of Nonlinear Equations and Optimization

CHAPTER 7 CONSTRAINED OPTIMIZATION 2: SQP AND GRG

Radar Trackers. Study Guide. All chapters, problems, examples and page numbers refer to Applied Optimal Estimation, A. Gelb, Ed.

Kernel Methods and SVMs Extension

VQ widely used in coding speech, image, and video

Transcription:

IV. Performance Optmzaton A. Steepest descent algorthm defnton how to set up bounds on learnng rate mnmzaton n a lne (varyng learnng rate) momentum learnng examples B. Newton s method defnton Gauss-Newton method Levenberg-Marquardt method C. Conjugate gradent method defnton conjugate drecton theorem method mplementaton example References: [Hagan], [Moon] 7/4/6 EC446.SuFy6/MPF

Performance Optmzaton Goal: NN: How do we fnd optmum (mnmum) ponts located on the performance (error) surface f(x)? Progressvely trans (learns) when t s presented feature vectors Learnng s teratve Optmzaton schemes are teratve w x+ = x + α p x= b learnng rate search drecton Schemes nvestgated A. Steepest descent B. Newton s method C. Conjugate gradent ntal mnmzaton along a lne Gauss Newton Levenberg-Marquardt 7/4/6 EC446.SuFy6/MPF

A. Steepest Descent Goal: Fnd p so that x+ = x + α p x x : search dreecton Use a aylor seres expanson to fnd p (stop at frst order approxmaton) = ( + ) F x F x x + For a one-dmensonal case: = F x + F x x+ + x Pc p so that * x F x x < p = F x 7/4/6 EC446.SuFy6/MPF 3

For F( x) = /x Ax+ d x+ c F( x) = F( x) = 7/4/6 EC446.SuFy6/MPF 4

Example: F( x) = x + 5x fnd F x, F x expresson for x(), and the teratve 7/4/6 EC446.SuFy6/MPF 5

A. What s the effect of α on the teratve scheme behavor? overdamped behavor - - - - α ncreases - underdamped behavor - - - - - - - unstable behavor 7/4/6 EC446.SuFy6/MPF 6

A.3 How to set up bounds on the learnng rate α? V F( x) F( x) = x Ax+ d x+ c VF( x) = = x + = 7/4/6 EC446.SuFy6/MPF 7

Overdamped/Underdamped Behavor defne c = x x ( α ) x I A x α d + = + + opt ( α ) α ( α ) ( α ) = I A x d A d = I A x + A I A d ( I α A x ) A d = + + ( α ) H ( α ) ( α ) c = I A c x opt = Q I Σ Q c ; A= QΣQ Q c = Q Q I Σ Q c H H H + I H ( α ) V I V + = Σ ( α ) V = I Σ V 7/4/6 EC446.SuFy6/MPF 8

x = c + x opt opt ( α ) = QV + x = q V + x = q Λ V + x change of sgn f Λ < o nsure overdamped behavor Select α > Λ for all so that doesn t flp sgn dependng f s even or odd. ( α ) Λ opt opt 7/4/6 EC446.SuFy6/MPF 9

Example: F( x) = x + x + x x + x Fnd the upper bound on α α =.39 α =.4 - - - - - - - - 7/4/6 EC446.SuFy6/MPF

A.4 Mnmzaton on a lne Alternatve for estmatng α α ( ) Mnmze F x + α κ at each teraton wth respect to F( x + ) = F( x + α p ), p = F( x ) Arbtrary functon s dffcult --> loo at quadratc case frst d d dx A= F x + α p = F( x) dα dx dα = F( x ) = F( x + α p ) p Use partal fracton expanson: p ( α ) = F( x ) + F( x ) p p ( ) α ( ) A = F x p + p F x p A = α = F( x ) p p F( x ) p ( ) x = x + α F x + 7/4/6 EC446.SuFy6/MPF

Contour Plot x - - - - x Recall: α computed so that F(x + α p ) s mnmum along the gradent lne F(x + α p ) s mnmum at x so that F(x ) = gradent at x + s orthogonal to gradent at x 7/4/6 EC446.SuFy6/MPF

Example: 9 f( n) x x x = + = x x Do teratons usng mnmzaton on a lne. x 9 = F( x) = = F x 7/4/6 EC446.SuFy6/MPF 3

7/4/6 EC446.SuFy6/MPF 4

Example: Pattern recognton ( classes) Steepest descent for a -3-- NN, step sze α =. step sze 4 Fgure 4.7: Pattern-recognton problem for a neural networ. Decson output (sold lne) NN output (dashed lne) 7/4/6 EC446.SuFy6/MPF 5

A.5 Momentum learnng speed of convergence for steepest descent may mprove f oscllatons n the teraton scheme are reduced. oscllatons may be vewed as hgh-frequency whch can be smoothed out by a low-pass flter. basc steepest descent teraton x + = x + x = x α F x Modfy as follows: wth ( ) x = x α γ F x γα F x + γ [,] γ = = α x+ x F x γ = = α basc steepest descent x+ x F x no slope update Impact of momentum: when both dervatves are of the same sgn, accelerate n that drecton. when both dervatves have dfferent sgns, momentum provdes a drag, whch tends to mnmze oscllatons and stablze behavor. 7/4/6 EC446.SuFy6/MPF 6

Why does momentum learnng wor? rajectory wth momentum 5 5-5 -5 5 5.5 error.5 3 Iteraton Number 7/4/6 EC446.SuFy6/MPF 7

Effects of momentum learnng and step sze on -3-- NN example α: step sze µ: momentum constant 7/4/6 EC446.SuFy6/MPF 8

B. Newton s Method Recall the steepest descent scheme s based on: F x + = F x + x = F x + F x x Newton s method s an extenson of expanson to nd order ( + ) = ( + ) F x F x x = F x + F x x + x F( x ) x Restrctng to quadratc functons F( x) = x Ax+ d x+ c F( x) = = x * F( x) = 7/4/6 EC446.SuFy6/MPF 9

d d x Fnd x so that F( x + x ) s mnmum d d x d F( x ) = F( x ) + F( x ) x + d x = F x + F x x = = F x + x F x F x For quadratc functons, the teraton becomes: [ ] x = x + x = x F x F x + F( x) = x Ax+ d + c F x = F x = 7/4/6 EC446.SuFy6/MPF

B. What happens when F(x) s not quadratc and we use the Newton s method n the teratve scheme? true approxmated - - - - - true - - - teraton of Newton s scheme from x = [.5 ] approxmated - - - - - - - - teraton of Newton s scheme from x = [.5 ] 7/4/6 EC446.SuFy6/MPF

Newton s method summary Newton method s based on a local approxmaton of F( x) by a quadratc functon. f F( x) s a quadratc CV n step f F( x) s not a quadratc may CV to a local mnmum saddle pont may oscllate Newton s method s expensve. need to solve F( x) at each teraton need to compute F x at each teraton 7/4/6 EC446.SuFy6/MPF

B. Gauss-Newton Method F( x) = x Ax+ d x+ c x = x F x F x + x+ = x A g expensve to compute need to approxmate F x = v x = v ( x) v( x) Rewrte N = = N F x v x F( x) = = v ( x) j x x j j v ( x) N v ( x) x x F( x) = x N v ( x) v ( x) = x Assume x= = Assume N = v x v x v( x) + v( x) x x F( x) = v( x) v( x) v( x) + v( x) x x 7/4/6 EC446.SuFy6/MPF 3

v x v x x x v x ( x) x x = J F x = v( x) v( x) v F x ( x) V ( x) v x v x v x v x v ( x) + + v ( x) + + = + x x x x v( x) v ( x) x x v x v x v x v x x x x x = + v x v x v x v x x x x x = J x J x + S x nvolves nd order dervatves terms whch can be neglected ( ) x J x J x J x V x + 7/4/6 EC446.SuFy6/MPF 4

B.3 Levenberg-Marquardt Scheme Gauss-Newton x + = x J x J x J x V x Addtonal robustness to numercal mplementaton x+ = x J x J x + ( ) J x x 7/4/6 EC446.SuFy6/MPF 5

C. Conjugate gradent method (CG method) Usng nd order nformaton s often too expensve. Go bac to st order approxmaton. For small problems: CG less effcent than Newton scheme. For large problems: CG s a leadng contender Note: ) Assume F( x) = x Ax+ d x+ c Assume we want to compute the mnmum of F(x) to fnd mn(f(x)): ) Defnton: Mutually conjugate vectors wth respect to a matrx A. (A-orthogonal) { p } Defnton: A set of vectors are mutually conjugate wth respect to a P.D. Hessan matrx A ff: p A p = j j Consequence: the egenvectors of A are A- conjugate 7/4/6 EC446.SuFy6/MPF 6

3) K { } If the vectors p are non-zero and are A- = conjugate for a P.D. matrx A; then the set of vectors p are lnearly ndependent. { } Consequences: ) Can mnmze a quadratc by searchng along egenvectors as they are the man axes of ellpsods (however, egenvectors requre the Hessan whch s expensve) ) Have a set of exact lne searches along a set of conjugate vectors mnmum can be reached n n steps comes down to computng conjugate drectons 7/4/6 EC446.SuFy6/MPF 7

C. Conjugate drecton theorem. Assume where Let F ( x) = x Ax+ d x+ c x= n be a set of A-conjugate vectors. For any ntal condton x, the teraton: where { p, } o p n x + = x +α p α = g p p Ap, g = F x = Ax + d converges to the unque mnmum x* of F(x) n n steps. Proof: 7/4/6 EC446.SuFy6/MPF 8

x x = α p = ( x ) p A x = because {p } are A-conjugate {p } are lnearly ndependent * ( ) * p A( x x ) p A x x = p Apα α = =,..., n p Ap * ( + ) p A x x x x = p Ap p A x = p Ap p Ap * ( x ) p A( x x ) + * ( x ) p A x = + p Ap 7/4/6 EC446.SuFy6/MPF 9 * n x x = α p p A α j p j= p Ap = * * ( ) ( + ( + )) p Ax Ax p Ax d Ax d = = p Ap p Ap * p ( F( x ) F( x ) ) = p Ap p g = p Ap j

C. CG method mplementaton CG requres nowledge of the conjugate drecton vectors p usually p are computed as the method progresses (not beforehand) recall x + = x +α p α = chosen to mnmze F(x) n the drecton p we need to select p, p j, so that; j j j, j p Ap = α p Ap = = x Ap = g p = Recall: g + g = (Ax + + c) (Ax + c) = A(x + x ) = g - we need to fnd p j so that: g p p Ap j 7/4/6 EC446.SuFy6/MPF 3

Iteraton p = -g (SD drecton) α x = x + α p, α = g = F( x ) = Ax() + c = g p p Ap K =, pc p so that p g so p = g + β p wth β so that p g g g ( g β ) p + = g p p Ap p g g g p = 7/4/6 EC446.SuFy6/MPF 3

Overall teraton scheme: x = x + α p α + + = g p + + + + p Ap g = Ax + c p = g + β p wth β = g g / g g + + + 7/4/6 EC446.SuFy6/MPF 3

Example:.8 F x = x x, x =.5 Implement the conjugate gradent scheme 7/4/6 EC446.SuFy6/MPF 33

CG - - - - Contour Plot Steepest Descent x - - - - x 7/4/6 EC446.SuFy6/MPF 34