Computational Intelligence Winter Term 2017/18

Similar documents
Computational Intelligence Winter Term 2018/19

Computational Intelligence

Computational Intelligence Winter Term 2017/18

Computational Intelligence

Numerical Optimization

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Introduction to gradient descent

C&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Lecture 2: Linear Algebra Review

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Problem Statement Continuous Domain Search/Optimization. Tutorial Evolution Strategies and Related Estimation of Distribution Algorithms.

Three Steps toward Tuning the Coordinate Systems in Nature-Inspired Optimization Algorithms

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Optimization and Gradient Descent

Three Steps toward Tuning the Coordinate Systems in Nature-Inspired Optimization Algorithms

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

7 Principal Component Analysis

Quasi-Newton Methods

14 Singular Value Decomposition

Numerical Optimization: Basic Concepts and Algorithms

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

The Story So Far... The central problem of this course: Smartness( X ) arg max X. Possibly with some constraints on X.

Neural Network Training

Optimization Tutorial 1. Basic Gradient Descent

Adaptive Coordinate Descent

Performance Surfaces and Optimum Points

Computational Intelligence Winter Term 2009/10

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Computational Intelligence

14. Nonlinear equations

B553 Lecture 5: Matrix Algebra Review

Conjugate Directions for Stochastic Gradient Descent

Convex Optimization Algorithms for Machine Learning in 10 Slides

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Gradient Descent. Dr. Xiaowei Huang

Lecture 18 Oct. 30, 2014

Computational Intelligence

Lecture 8 : Eigenvalues and Eigenvectors

Evolutionary Algorithms How to Cope With Plateaus of Constant Fitness and When to Reject Strings of The Same Fitness

Completely Derandomized Self-Adaptation in Evolution Strategies

Constrained optimization: direct methods (cont.)

Introduction to Optimization

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Multivariate Statistical Analysis

HOMEWORK #4: LOGISTIC REGRESSION

Computational statistics

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Convex Optimization. Problem set 2. Due Monday April 26th

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009

Gaussian and Linear Discriminant Analysis; Multiclass Classification

ECEN 615 Methods of Electric Power Systems Analysis Lecture 18: Least Squares, State Estimation

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

This ensures that we walk downhill. For fixed λ not even this may be the case.

Advanced Optimization

Linear Models for Regression. Sargur Srihari

Lecture 17: Numerical Optimization October 2014

Evolutionary Tuning of Multiple SVM Parameters

HW3 - Due 02/06. Each answer must be mathematically justified. Don t forget your name. 1 2, A = 2 2

Bio-inspired Continuous Optimization: The Coming of Age

Conjugate Gradient (CG) Method

Introduction to unconstrained optimization - direct search methods

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Matrix Derivatives and Descent Optimization Methods

Natural Gradient for the Gaussian Distribution via Least Squares Regression

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

Chapter 10 Conjugate Direction Methods

Linear Regression (continued)

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

COMS 4771 Regression. Nakul Verma

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Bindel, Fall 2011 Intro to Scientific Computing (CS 3220) Week 6: Monday, Mar 7. e k+1 = 1 f (ξ k ) 2 f (x k ) e2 k.

Covariance Matrix Adaptation in Multiobjective Optimization

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

1 Kernel methods & optimization

For those who want to skip this chapter and carry on, that s fine, all you really need to know is that for the scalar expression: 2 H

Symmetric Matrices and Eigendecomposition

UCSD ECE269 Handout #8 Prof. Young-Han Kim Wednesday, February 7, Homework Set #4 (Due: Wednesday, February 21, 2018)

A Fast Algorithm for Computing High-dimensional Risk Parity Portfolios

Evolution Strategies and Covariance Matrix Adaptation

Math Matrix Algebra

COURSE ON LMI PART I.2 GEOMETRY OF LMI SETS. Didier HENRION henrion

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Here each term has degree 2 (the sum of exponents is 2 for all summands). A quadratic form of three variables looks as

1 Cricket chirps: an example

Basic Linear Algebra in MATLAB

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Introduction to Optimization

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

FALL 2018 MATH 4211/6211 Optimization Homework 4

Transcription:

Computational Intelligence Winter Term 2017/18 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund

mutation: Y = X + Z Z ~ N(0, C) multinormal distribution maximum entropy distribution for support R n, given expectation vector and covariance matrix how should we choose covariance matrix C? unless we have not learned something about the problem during search don t prefer any direction! covariance matrix C = I n (unit matrix) x x x C = I n C = diag(s 1,...,s n ) C orthogonal 2

claim: mutations should be aligned to isolines of problem (Schwefel 1981) if true then covariance matrix should be inverse of Hessian matrix! assume f(x) ½ x Ax + b x + c H = A Z ~ N(0, C) with density since then many proposals how to adapt the covariance matrix extreme case: use n+1 pairs (x, f(x)), apply multiple linear regression to obtain estimators for A, b, c invert estimated matrix A! OK, but: O(n 6 )! (Rudolph 1992) 3

doubts: are equi-aligned isolines really optimal? principal axis should point into negative gradient direction! (proof next slide) most (effective) algorithms behave like this: run roughly into negative gradient direction, sooner or later we approach longest main principal axis of Hessian, now negative gradient direction coincidences with direction to optimum, which is parallel to longest main principal axis of Hessian, which is parallel to the longest main principal axis of the inverse covariance matrix (Schwefel OK in this situation) 4

Z = rqu, A = B B, B = Q -1 if Qu were deterministic... set Qu = - f(x) (direction of steepest descent) 5

Apart from (inefficient) regression, how can we get matrix elements of Q? iteratively: C (k+1) = update( C (k), Population (k) ) basic constraint: C (k) must be positive definite (p.d.) and symmetric for all k 0, Lemma otherwise Cholesky decomposition impossible: C = Q Q Let A and B be quadratic matrices and α, β > 0. a) A, B symmetric α A + β B symmetric. b) A positive definite and B positive semidefinite α A + β B positive definite Proof: ad a) C = α A + β B symmetric, since c ij = α a ij + β b ij = α a ji + β b ji = c ji ad b) x R n \ {0}: x (αa + β B) x = α x Ax + β x Bx > 0 0 > 0 6

Theorem A quadratic matrix C (k) is symmetric and positive definite for all k 0, if it is built via the iterative formula C (k+1) = α k C (k) + β k v k v k where C (0) = I n, v k 0, α k > 0 and liminf β k > 0. Proof: If v 0, then matrix V = vv is symmetric and positive semidefinite, since as per definition of the dyadic product v ij = v i v j = v j v i = v ji for all i, j and for all x R n : x (vv ) x = (x v) (v x) = (x v) 2 0. Thus, the sequence of matrices v k v k is symmetric and p.s.d. for k 0. Owing to the previous lemma matrix C (k+1) is symmetric and p.d., if C (k) is symmetric as well as p.d. and matrix v k v k is symmetric and p.s.d. Since C (0) = I n symmetric and p.d. it follows that C (1) is symmetric and p.d. Repetition of these arguments leads to the statement of the theorem. 7

CMA-ES Idea: Don t estimate matrix C in each iteration! Instead, approximate iteratively! Covariance Matrix Adaptation Evolutionary Algorithm (CMA-EA) (Hansen, Ostermeier et al. 1996ff.) Set initial covariance matrix to C (0) = I n C (t+1) = (1-η) C (t) + η w i (x i:λ m (t) ) (x i:λ m (t) ) η : learning rate (0,1) w i : weights; mostly 1/µ m = mean of all selected parents complexity: O(µn 2 + n 3 ) sorting: f(x 1:λ ) f(x 2:λ )... f(x λ:λ ) Caution: must use mean m (t) of old selected parents; not new mean m (t+1)! Seeking covariance matrix of fictitious distribution pointing in gradient direction! 8

CMA-ES State-of-the-art: CMA-EA (currently many variants) many successful applications in practice available in WWW: http://www.lri.fr/~hansen/cmaes_inmatlab.html C, C++, Java Fortran, Python, Matlab, R, Scilab http://shark-project.sourceforge.net/ (EAlib, C++) advice: before designing your own new method or grabbing another method with some fancy name... try CMA-ES it is available in most software libraries and often does the job! 9