Introduction to Optimization - PDF Free Download

Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning

So far Machine learning is important and interesting The general concept: Fitting models to data

So far Machine learning is important and interesting The general concept: Searching for the best fitting model

So far Machine learning is important and interesting The general concept: Searching for the Optimization! best fitting model

Today 1. Optimization is important 2. Optimization is possible

Today 1. Optimization is important 2. Optimization is possible* * Basic techniques Constrained / Unconstrained Analytic / Iterative Continuous / Discrete

Special cases of optimization Machine learning

Special cases of optimization Machine learning Algorithms and data structures General problem-solving Management and decision-making

Special cases of optimization Machine learning Algorithms and data structures General problem-solving Management and decision-making Evolution The Meaning of Life?

Optimization task Given a function find the argument x resulting in the optimal value.

Constrained optimization task Given a function find the argument x resulting in the optimal value, subject to

Optimization methods In principle, x can be anything: Discrete Value (e.g. a name) Structure (e.g. a graph, plaintext) Finite / infinite Continuous* Real-number, vector, matrix, Complex-number, function,

Optimization methods In principle, f can be anything: Random oracle Structured Continuous Differentiable Convex

Optimization methods Knowledge about f Not much A lot Type of x Discrete Continuous Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, Algorithmic Analytic

Optimization methods Finding a weightvector β, minimizing the model error Type of x Discrete Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Optimization methods Finding a weightvector β, minimizing the model error, in a fairly general case Type of x Discrete Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Optimization methods Finding a weightvector β, minimizing the model error, in a very general case Type of x Discrete Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Optimization methods Finding a weightvector β, minimizing the model error, Discrete in many practical cases Type of x Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Optimization methods Knowledge about f This lecture Discrete Type of x Continuous Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Minima and maxima

Differentiability

The Most Important Observation

The Most Important Observation This small observation gives us everything we need for now A nice interpretation of the gradient An extremality criterion An iterative algorithm for function minimization

Interpretation of the gradient

Extremality criterion

Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise,

Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise, make a small step downhill: x 1 x 0 μ 0 f x 0

Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise, make a small step downhill: x 1 x 0 μ 0 f x 0 4. and then another step x 2 x 1 μ 1 f x 1 5. and so on until

Gradient descent 1. 2. 3. 4. x 1 x 0 μ 0 f x 0 x 2 x 1 μ 1 f x 1

Gradient descent x i+1 x i μ i f x i

Gradient descent x i = μ i f x i

Gradient descent x i = μc

Gradient descent (fixed step) x i = μ f x i

Example

Stochastic gradient descent Whenever the function to be minimized is a sum over samples coming from some distribution f w = g(w, x k ) the gradient is also a sum: f w = g(w, x k )

Stochastic gradient descent The step of the gradient descent algorithm is then: w i = μ g w i, x k It is referred to as the batch update. It turns out, the minimization can also be performed by sampling a single random element from the sum on each step (the on-line update). w i = μ g w i, x random

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x Gradient Hessian

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + f x i T x + 1 2 xt 2 f(x i ) x Gradient Hessian

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x Find the optimum analytically: c + H x = 0

Second-order methods Newton s method: x i = H(x i ) 1 f x i

Second-order methods Newton s method: x i = H 1 c

Second-order methods Gradient descent (fixed step): Newton s method: x i = μ c x i = H 1 c

Second-order methods Gradient descent (fixed step): Newton s method: x i = μ c x i = H 1 c Quasi-newton s methods: x i = R i c (where R i is an iteratively computed approximation to true inverse Hessian)

Second-order methods Gradient descent (fixed step): x i = μ f x i Newton s method: x i = H(x i ) 1 f x i Quasi-newton s methods: x i = R i f x i (where R i is an iteratively computed approximation to true inverse Hessian)

Convexity Even among differentiable functions, some are very unfriendly:

Convexity There is, however, a class of particularly nice convex functions:

Convexity

Convexity A strictly convex function has a unique minimum. Due to convexity it is easy to find this minimum. You don t even need differentiability! Many practically useful functions (e.g. norm) are convex.

Summary By now you should: Be capable of seeing the world as an optimization problem. Be prepared to apply optimization techniques in practice. Know: Global/local minimum/maximum. Convexity, differentiability, Fermat s theorem ;) Gradient, Gradient descent, Stochastic gradient descent, batch vs on-line updates, Hessian, Newton s method.

Summary 1. Optimization is important 2. Optimization is possible*

* The following material not covered in the lecture, but highly recommended for self-study.