LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Similar documents
Lecture Notes: Geometric Considerations in Unconstrained Optimization

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

The Steepest Descent Algorithm for Unconstrained Optimization

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

Gradient Descent. Sargur Srihari

Gradient Descent. Dr. Xiaowei Huang

1 Numerical optimization

Unconstrained optimization

1 Numerical optimization

Scientific Computing: Optimization

ECS171: Machine Learning

ECS171: Machine Learning

Optimization and Gradient Descent

Nonlinear Optimization: What s important?

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Neural Network Training

HW3 - Due 02/06. Each answer must be mathematically justified. Don t forget your name. 1 2, A = 2 2

Extreme Values and Positive/ Negative Definite Matrix Conditions

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Numerical Optimization Techniques

Lecture 3 - Linear and Logistic Regression

5 Overview of algorithms for unconstrained optimization

Lecture 17: Numerical Optimization October 2014

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Introduction to gradient descent

4 damped (modified) Newton methods

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

CHAPTER 2: QUADRATIC PROGRAMMING

Numerical Optimization

x k+1 = x k + α k p k (13.1)

Regression with Numerical Optimization. Logistic

Day 3 Lecture 3. Optimizing deep networks

ECE580 Partial Solution to Problem Set 3

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

MATH 4211/6211 Optimization Basics of Optimization Problems

CS-E4830 Kernel Methods in Machine Learning

Selected Topics in Optimization. Some slides borrowed from

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Introduction to Optimization

APPLICATIONS The eigenvalues are λ = 5, 5. An orthonormal basis of eigenvectors consists of

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

Math 273a: Optimization Netwon s methods

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Conjugate Gradient (CG) Method

Conjugate Gradient Tutorial

5 Finding roots of equations

8 Numerical methods for unconstrained problems

REVIEW OF DIFFERENTIAL CALCULUS

Math 273a: Optimization Basic concepts

Nonlinear Optimization

Iterative Reweighted Least Squares

A A x i x j i j (i, j) (j, i) Let. Compute the value of for and

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Introduction to unconstrained optimization - direct search methods

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

The Conjugate Gradient Method

Gradient Descent Methods

Optimization. Next: Curve Fitting Up: Numerical Analysis for Chemical Previous: Linear Algebraic and Equations. Subsections

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

2.10 Saddles, Nodes, Foci and Centers

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Symmetric Matrices and Eigendecomposition

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Minimization of Static! Cost Functions!

Constrained Optimization and Lagrangian Duality

minimize x subject to (x 2)(x 4) u,

Lecture 3. Optimization Problems and Iterative Algorithms

Course Notes: Week 4

Practical Optimization: Basic Multidimensional Gradient Methods

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

Programming, numerics and optimization

Why should you care about the solution strategies?

LOCAL SEARCH. Today. Reading AIMA Chapter , Goals Local search algorithms. Introduce adversarial search 1/31/14

CPSC 540: Machine Learning

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

1 Kernel methods & optimization

Numerical Optimization

FALL 2018 MATH 4211/6211 Optimization Homework 4

GRADIENT = STEEPEST DESCENT

Recitation 1. Gradients and Directional Derivatives. Brett Bernstein. CDS at NYU. January 21, 2018

Lecture 11: Differential Geometry

Optimization methods

Vector Derivatives and the Gradient

Nonlinear Programming

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Nonlinear Programming Models

2.098/6.255/ Optimization Methods Practice True/False Questions

1 Computing with constraints

Numerical optimization

Two hours. To be provided by Examinations Office: Mathematical Formula Tables. THE UNIVERSITY OF MANCHESTER. xx xxxx 2017 xx:xx xx.

Constrained optimization: direct methods (cont.)

Transcription:

15-382 COLLECTIVE INTELLIGENCE - S19 LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION TEACHER: GIANNI A. DI CARO

WHAT IF WE HAVE ONE SINGLE AGENT PSO leverages the presence of a swarm: the outcome is truly a collective behavior If left alone, each individual agent would behave like a hill-climber when moving in the direction of a local optimum, and then it will have a quite hard time to escape it A single agent doesn t look impressive How can a single agent be smarter? 2

A (FIRST) GENERAL APPROACH What is a good direction p? How small / large / constant / variable should be the step size α k? How do we check that we are at the minimum? This could work for a local optimum, but what about finding the global optimum? 3

IF WE HAVE THE FUNCTION (AND ITS DERIVATIVES)! "! " % + ' (!(" % )*(" " % ) From 1st order Taylor series: Equation of the tangent plane to f in x 0 : the gradient vector is orthogonal to the tangent plane The partial derivatives determine the slope of the plane In each point x, the gradient vector is orthogonal to the isocountours, {x : f(x)=c} The gradient vector points in the direction of maximal change of the function in the point. The magnitude of the gradient is the (max) velocity of change for the function 4

GRADIENTS AND RATE OF CHANGE Directional derivative for a function f: R n R is the rate of change of the function along a given direction v (where v is a unit-norm vector): E.g., for a function of two variables and a direction v=(v x, v y ) Partial derivatives are directional derivatives along the vectors of the canonical basis of R n 5

GRADIENTS AND RATE OF CHANGE Theorem: Given a direction v and a differentiable function f, then in each point x 0 of the function domain the following relation exists between the gradient vector and the directional derivative: Corollary 1: Corollary 2: only when the gradient is parallel to the directional derivative (θ=0) that is, the directional derivative in a point gets its maximum value when the direction v is the same as the gradient vector In each point, the gradient vector corresponds to the direction of maximal change of the function The norm of the gradient corresponds to the max velocity of change in the point 6

GRADIENT DESCENT / ASCENT Move in the direction opposite to (min) to or aligned with (max) the gradient vector, the direction of maximal change of the function 7

ONLY LOCAL OPTIMA The final local optimum depends on where we start from (for non-convex functions) 8

~ MOTION IN A POTENTIAL FIELD GD run ~ Motion of a mass in a potential field towards the minimum energy configuration. At each point the gradient defines the attraction force, while the step! scales the force, to define the next point 9

GRADIENT DESCENT ALGORITHM (MIN) 1. Initialization (a) Definition of a starting point x 0 (b) Definition of a tolerance parameter for converence (c) Initialization of the iteration variable, k 0 2. Computation of a feasible direction for moving d k r k f(x k ) 3. Definition of the feasible (max) length of the move k min f(x k + d k ) (1-dimensional problem in 2 R, the farthest point to where f(x k + d k ) keeps increasing) 4. Move to new point in the direction of gradient descent x k+1 x k + k d k 5. Check for convergence If k k d k k < [or, if k apple c 0, where c>0 is a small constant] (i.e., gradient becomes zero) (a) Output: x = x k+1, f(x )=f(x k+1 ) (b) Stop Otherwise, k k +1, and go to Step 2 10

STEP BEHAVIOR If the gradient in x 0 points to the local minimum, then a line search would determine a step size α k that would take directly to the minimum However, this lucky case only happens in perfectly conditioned functions, or for a restricted set of points It might be heavy to solve a 1D optimization problem at each step, such that some approximate methods can be preferred In the general case, the moving directions d k are perpendicular to each other: α* minimizes f(x k + α*d k ), such that df/dα must be zero in α* but, 11

ILL-CONDITIONED PROBLEMS Ill-conditioned problem Well-conditioned problem If the function is very anisotropic, then the problem is said ill-conditioned, since the gradient vector doesn t point to the direction of the local minimum, resulting into a zig-zagging trajectory Ill-conditioning can be determined by computing the ratio between the eigenvalues of the Hessian matrix (the matrix of the second partial derivatives) 12

WHAT ABOUT A CONSTANT STEP SIZE? Small, good!, convergence! is too large, divergence 13

WHAT ABOUT A CONSTANT STEP SIZE?! is too low, slow convergence! is too large, divergence Adapting the step size may be necessary to avoid either a too slow progress, or overshooting the target minimum 14

EXAMPLE: SUPERVISED MACHINE LEARNING Loss function: Sum of squared errors: m labeled training samples (examples) y (i) = known correct value for sample i x (i) θ = linear hypothesis function, θ vector of parameters Goal: find the value of the parameter vector θ such that the loss (errors in classification / regression) is minimized (over the training set) Any analogy with PSO? Gradient: Ø If the averaging factor 1/2& is used, then the update action becomes: /!! % & ' ((*) ( (*), -!. (*) *01 15

STOCHASTIC GRADIENT DESCENT 16

ADAPTIVE STEP SIZE? Problem min f(x k + d k ) can be tackled using any algorithm for one-dimensional search: Fibonacci, Golden section, Newton s method, Secant method,... If f 2 C 2, it s possible to compute the Hessian H k, and k can be determined analytically: Let y k = d k be the (feasible) small displacement applied in x k, such that from Taylor s series about x k, truncated at the 2nd order (quadratic approximation): f(x k + y k ) f(x k )+y T k r k + 1 2 yt k H k y k In the algorithm, d k is the direction of steepest descent, y k = r k, such that: f(x k r k ) f(x k ) r T k r k + 1 2 2 r T k H k r k We are looking for minimizing f! first order conditions must be satisfied, df d df! =0: df (x k r k ) d r T k r k + r T k H k r k =0 ) = k rt k r k r T k H kr k Updating rule of the gradient descent algorithm becomes: x k+1 x k r T k r k r T k H kr k r k 17

CONDITIONS ON THE HESSIAN MATRIX x k+1 x k r T k r k r T k H kr k r k The value found for k is a sound estimate of the correct value of as long as the quadratic approximation of the Taylor series is an accurate approximation of the f(x). At the beginning of the iterations, ky k k will be quite large since the approximation will be inaccurate, but getting closer to the minimum ky k k will decrease accordingly, and the accuracy will keep increasing. If f is a quadratic function, the Taylor series is exact, and we can use = instead of ) k is exact at each iteration The Hessian matrix is the matrix of 2nd order partial derivatives. E.g., for f : X R 3 7! R, the Hessian matrix computed in a point x : 0 1 @ 2 f @ 2 f @ 2 f @x 2 1 @x 1 @x 2 @x 1 @x 3 H x = @ 2 f @ 2 f @ 2 f @x 1 @x 2 @x 2 2 @x 2 @x 3 B C @ @ 2 f @ 2 f @ 2 f A @x 3 @x 1 @x 3 @x 2 @x 2 3 x 18

PROPERTIES OF THE HESSIAN MATRIX Theorem (Schwartz): If a function is C 2 (twice di erentiable with continuity) in R n! The order of derivation is irrelevant. ) The Hessian matrix is symmetric Theorem (Quadratic form of a matrix): Given a matrix H 2 R n n, squared and symmetric, the associated quadratic form is defined as the function: q(x) = 1 Positive definite (convex) xt Hx The matrix is said: 2 Positive definite if x T Hx > 0, 8x 2 R n, x 6= 0 Positive semi-definite if x T Hx 0, 8x 2 R n Negative definite if x T Hx < 0, 8x 2 R n, x 6= 0 Negative semi-definite if x T Hx apple 0, 8x 2 R n Indefinite if x T Hx > 0 for some x and x T Hx < 0 for others x Positive semi-definite Indefinite (saddle) Negative definite (concave) 19

HESSIAN QUADRATIC FORM AND MIN/MAX Theorem (Use of the eigenvalues): A matrix H 2 R n n is positive definite (semi-definite) if and only if all its eigenvalues are positives (non negatives). Proof sketch: For a diagonal matrix D =diag(d 1,,d n ) where each diagonal entry is positive, the theorem holds, since x T Dx = P d i x 2 i > 0 (unless x =0). Since H is real and symmetric (Hermitian), the spectral theorem says that it is diagonalizable using the eigenvector matrix E as orthonormal basis for the transformation: E 1 AE =, where the elements of the diagonal matrix are the eigenvalues of A. Therefore, A is positive (negative) definite if all its eigenvalues are positive (negative). Theorem (Hessian and min/max points): Given x,astationary point of f : X R n 7! R (i.e., a point where rf(x )=0), and given f s Hessian matrix H(f), evaluated in in x, the following conditions are su cient to determine the nature of x as an extremum of the function: 1. If H is positive definite! x is a minimum (local or global) 2. If H is negative definite! x is a maximum (local or global) 3. If H has eigenvalues of opposite sign! x is (in general), a saddle If H semi-definite, positive or negative, is more complex to analyze... 20

CONVERGENCE SPEED OF GRADIENT DESCENT Previous theorem says that in the neighborhood of a local minimum any function C 2 is approximated by a positive definite quadratic form: the function is strictly convex The Hessian matrix is symmetric The quadratic form is also symmetric In 2D The isocurves are ellipsoids: Axes are oriented as the eigenvectors of H Length is proportional to Convergence speed depends on the shape of the ellipsoids ~ Spherical: fast convergence Unbalanced: slow convergence Convergence rate (p=1, linear) For a quadratic function, p=1: 21

H = NUMERIC EXAMPLE 2 1 2, from the characteristic equation, det(h I) = 0 we get: 4 +3 = 0! 1 2 1 =1, 2 =3. For the eigenvectors, Hx = x: 1 =1: Hx = x! (H I)x = 0, that reduces to equation x 2 = x 1 1! all 1 s eigenvectors are x 1 1 2 =3: Hx = x! (H 3I)x = 0, that reduces to equation x 2 = x 1 1! all 2 s eigenvectors are x 1 1 There are two directions defined by the eigenvectors: x 2 = x 1 and x 2 = x 1 Isolines of the quadratic form: x T Hx = k! 2x 2 1 +2x 1 x 2 +2x 2 2 = k We can transform it in the canonical form, in the eigenvector basis! X 2 1 + Y 2 2 = k, which is an ellipsis oriented in the direction of the eigenvectors and semi-axes of length p k/ 1, p k/ 2 22

STEP-SIZE WITHOUT HESSIAN MATRIX Let s assume to have a local estimate ˆ of the values k, and let s denote f(x k ) as f k, and f(x k ˆ r k ) as ˆf k. From the previous 2nd order truncated Taylor s series: ˆf k = f(x k ˆ r k ) f k ˆ r T k r k + 1 2 rt k H k r k ) r T k H k r k 2( ˆf k f k +ˆ r T k r k) ˆ 2 From the previous result, we can equate: k rt k r k r T k H kr k ) k r T k r k ˆ 2 2( ˆf k f k +ˆ r T k r k) 2( ) x k+1 x ˆf k f k +ˆ r T k r k) k r ˆ 2 k A reasonable choice for ˆ is k starting with 0 =1 1 : the (optimal) value taken by in the previous iteration, Also in this case the approximation is more or less good depending on how the 2nd order Taylor series is an accurate approximation of f(x) Moral: Isn t easier (more effective / general) to use and play with PSO? 23