Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning
So far Machine learning is important and interesting The general concept: Fitting models to data
So far Machine learning is important and interesting The general concept: Searching for the best fitting model
So far Machine learning is important and interesting The general concept: Searching for the Optimization! best fitting model
Today 1. Optimization is important 2. Optimization is possible
Today 1. Optimization is important 2. Optimization is possible* * Basic techniques Constrained / Unconstrained Analytic / Iterative Continuous / Discrete
Special cases of optimization Machine learning
Special cases of optimization Machine learning Algorithms and data structures General problem-solving Management and decision-making
Special cases of optimization Machine learning Algorithms and data structures General problem-solving Management and decision-making Evolution The Meaning of Life?
Optimization task Given a function find the argument x resulting in the optimal value.
Constrained optimization task Given a function find the argument x resulting in the optimal value, subject to
Optimization methods In principle, x can be anything: Discrete Value (e.g. a name) Structure (e.g. a graph, plaintext) Finite / infinite Continuous* Real-number, vector, matrix, Complex-number, function,
Optimization methods In principle, f can be anything: Random oracle Structured Continuous Differentiable Convex
Optimization methods Knowledge about f Not much A lot Type of x Discrete Continuous Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, Algorithmic Analytic
Optimization methods Finding a weightvector β, minimizing the model error Type of x Discrete Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic
Optimization methods Finding a weightvector β, minimizing the model error, in a fairly general case Type of x Discrete Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic
Optimization methods Finding a weightvector β, minimizing the model error, in a very general case Type of x Discrete Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic
Optimization methods Finding a weightvector β, minimizing the model error, Discrete in many practical cases Type of x Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic
Optimization methods Knowledge about f This lecture Discrete Type of x Continuous Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic
Minima and maxima
Differentiability
Differentiability
Differentiability
The Most Important Observation
The Most Important Observation
The Most Important Observation This small observation gives us everything we need for now A nice interpretation of the gradient An extremality criterion An iterative algorithm for function minimization
Interpretation of the gradient
Interpretation of the gradient
Extremality criterion
Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise,
Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise, make a small step downhill: x 1 x 0 μ 0 f x 0
Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise, make a small step downhill: x 1 x 0 μ 0 f x 0 4. and then another step x 2 x 1 μ 1 f x 1 5. and so on until
Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise, make a small step downhill: x 1 x 0 μ 0 f x 0 4. and then another step x 2 x 1 μ 1 f x 1 5. and so on until f x n 0 or we re tired. With a smart choice of μ i we ll converge to a minimum
Gradient descent 1. 2. 3. 4. x 1 x 0 μ 0 f x 0 x 2 x 1 μ 1 f x 1
Gradient descent x i+1 x i μ i f x i
Gradient descent x i = μ i f x i
Gradient descent x i = μc
Gradient descent (fixed step) x i = μ f x i
Gradient descent (fixed step) x i = μ f x i
Example
Stochastic gradient descent Whenever the function to be minimized is a sum over samples coming from some distribution f w = g(w, x k ) the gradient is also a sum: f w = g(w, x k )
Stochastic gradient descent The step of the gradient descent algorithm is then: w i = μ g w i, x k It is referred to as the batch update. It turns out, the minimization can also be performed by sampling a single random element from the sum on each step (the on-line update). w i = μ g w i, x random
Second-order methods First-order approximation (~ differentiation): f x f x i + c T x
Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x
Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x Gradient Hessian
Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + f x i T x + 1 2 xt 2 f(x i ) x Gradient Hessian
Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x
Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x Find the optimum analytically: c + H x = 0
Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x Find the optimum analytically: c + H x = 0 x = H 1 c
Second-order methods Newton s method: x i = H(x i ) 1 f x i
Second-order methods Newton s method: x i = H 1 c
Second-order methods Gradient descent (fixed step): Newton s method: x i = μ c x i = H 1 c
Second-order methods Gradient descent (fixed step): Newton s method: x i = μ c x i = H 1 c Quasi-newton s methods: x i = R i c (where R i is an iteratively computed approximation to true inverse Hessian)
Second-order methods Gradient descent (fixed step): x i = μ f x i Newton s method: x i = H(x i ) 1 f x i Quasi-newton s methods: x i = R i f x i (where R i is an iteratively computed approximation to true inverse Hessian)
Convexity Even among differentiable functions, some are very unfriendly:
Convexity There is, however, a class of particularly nice convex functions:
Convexity
Convexity A strictly convex function has a unique minimum. Due to convexity it is easy to find this minimum. You don t even need differentiability! Many practically useful functions (e.g. norm) are convex.
Summary By now you should: Be capable of seeing the world as an optimization problem. Be prepared to apply optimization techniques in practice. Know: Global/local minimum/maximum. Convexity, differentiability, Fermat s theorem ;) Gradient, Gradient descent, Stochastic gradient descent, batch vs on-line updates, Hessian, Newton s method.
Summary 1. Optimization is important 2. Optimization is possible*
* The following material not covered in the lecture, but highly recommended for self-study.