Introduction to Optimization

Similar documents
Selected Topics in Optimization. Some slides borrowed from

Gradient Descent. Dr. Xiaowei Huang

Optimization and Gradient Descent

Regression with Numerical Optimization. Logistic

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Quasi-Newton Methods

Conditional Gradient (Frank-Wolfe) Method

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to gradient descent

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Stochastic Subgradient Method

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Chapter 2. Optimization. Gradients, convexity, and ALS

Big Data Analytics. Lucas Rego Drumond

Non-convex optimization. Issam Laradji

CPSC 540: Machine Learning

Linear Regression. S. Sumitra

Mathematical optimization

1 Lecture 25: Extreme values

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Nonlinear Optimization for Optimal Control

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Stochastic Gradient Descent

arxiv: v1 [math.oc] 1 Jul 2016

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

FALL 2018 MATH 4211/6211 Optimization Homework 4

CSC321 Lecture 8: Optimization

5 Quasi-Newton Methods

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Constrained Optimization

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Optimization II: Unconstrained Multivariable

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

CPSC 540: Machine Learning

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Day 3 Lecture 3. Optimizing deep networks

Numerical Optimization. Review: Unconstrained Optimization

Unconstrained optimization

Least Mean Squares Regression

Scientific Computing: Optimization

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Nonlinear Optimization: What s important?

Introduction to Logistic Regression and Support Vector Machine

Presentation in Convex Optimization

ECS171: Machine Learning

Introduction to unconstrained optimization - direct search methods

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Algorithms for Constrained Optimization

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Machine Learning (CSE 446): Neural Networks

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CSC242: Intro to AI. Lecture 21

1 Introduction

Linear Convergence under the Polyak-Łojasiewicz Inequality

Deep Learning & Neural Networks Lecture 4

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

ARE202A, Fall 2005 CONTENTS. 1. Graphical Overview of Optimization Theory (cont) Separating Hyperplanes 1

Constrained Optimization and Lagrangian Duality

1 Sparsity and l 1 relaxation

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

On Nesterov s Random Coordinate Descent Algorithms - Continued

MIT Manufacturing Systems Analysis Lecture 14-16

Convex Optimization. Problem set 2. Due Monday April 26th

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Why should you care about the solution strategies?

8 Numerical methods for unconstrained problems

Classification with Perceptrons. Reading:

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

SGD and Deep Learning

minimize x subject to (x 2)(x 4) u,

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Convex Optimization Lecture 16

CSC321 Lecture 7: Optimization

Lecture 14 Ellipsoid method

Neural Network Training

Lecture 5: September 12

Gradient Descent. Sargur Srihari

Line Search Methods for Unconstrained Optimisation

4TE3/6TE3. Algorithms for. Continuous Optimization

Linear Models in Machine Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CSCI567 Machine Learning (Fall 2014)

2. Quasi-Newton methods

Numerical Optimization Techniques

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Least Mean Squares Regression. Machine Learning Fall 2018

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Transcription:

Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning

So far Machine learning is important and interesting The general concept: Fitting models to data

So far Machine learning is important and interesting The general concept: Searching for the best fitting model

So far Machine learning is important and interesting The general concept: Searching for the Optimization! best fitting model

Today 1. Optimization is important 2. Optimization is possible

Today 1. Optimization is important 2. Optimization is possible* * Basic techniques Constrained / Unconstrained Analytic / Iterative Continuous / Discrete

Special cases of optimization Machine learning

Special cases of optimization Machine learning Algorithms and data structures General problem-solving Management and decision-making

Special cases of optimization Machine learning Algorithms and data structures General problem-solving Management and decision-making Evolution The Meaning of Life?

Optimization task Given a function find the argument x resulting in the optimal value.

Constrained optimization task Given a function find the argument x resulting in the optimal value, subject to

Optimization methods In principle, x can be anything: Discrete Value (e.g. a name) Structure (e.g. a graph, plaintext) Finite / infinite Continuous* Real-number, vector, matrix, Complex-number, function,

Optimization methods In principle, f can be anything: Random oracle Structured Continuous Differentiable Convex

Optimization methods Knowledge about f Not much A lot Type of x Discrete Continuous Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, Algorithmic Analytic

Optimization methods Finding a weightvector β, minimizing the model error Type of x Discrete Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Optimization methods Finding a weightvector β, minimizing the model error, in a fairly general case Type of x Discrete Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Optimization methods Finding a weightvector β, minimizing the model error, in a very general case Type of x Discrete Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Optimization methods Finding a weightvector β, minimizing the model error, Discrete in many practical cases Type of x Continuous Knowledge about f Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Optimization methods Knowledge about f This lecture Discrete Type of x Continuous Not much Combinatorial search: Brute-force, Stepwise, MCMC, Population-based, Numeric methods: Gradient-based, Newton-like, MCMC, Population-based, A lot Algorithmic Analytic

Minima and maxima

Differentiability

Differentiability

Differentiability

The Most Important Observation

The Most Important Observation

The Most Important Observation This small observation gives us everything we need for now A nice interpretation of the gradient An extremality criterion An iterative algorithm for function minimization

Interpretation of the gradient

Interpretation of the gradient

Extremality criterion

Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise,

Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise, make a small step downhill: x 1 x 0 μ 0 f x 0

Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise, make a small step downhill: x 1 x 0 μ 0 f x 0 4. and then another step x 2 x 1 μ 1 f x 1 5. and so on until

Gradient descent 1. Pick random point x 0 2. If f x 0 = 0, then we ve found an extremum. 3. Otherwise, make a small step downhill: x 1 x 0 μ 0 f x 0 4. and then another step x 2 x 1 μ 1 f x 1 5. and so on until f x n 0 or we re tired. With a smart choice of μ i we ll converge to a minimum

Gradient descent 1. 2. 3. 4. x 1 x 0 μ 0 f x 0 x 2 x 1 μ 1 f x 1

Gradient descent x i+1 x i μ i f x i

Gradient descent x i = μ i f x i

Gradient descent x i = μc

Gradient descent (fixed step) x i = μ f x i

Gradient descent (fixed step) x i = μ f x i

Example

Stochastic gradient descent Whenever the function to be minimized is a sum over samples coming from some distribution f w = g(w, x k ) the gradient is also a sum: f w = g(w, x k )

Stochastic gradient descent The step of the gradient descent algorithm is then: w i = μ g w i, x k It is referred to as the batch update. It turns out, the minimization can also be performed by sampling a single random element from the sum on each step (the on-line update). w i = μ g w i, x random

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x Gradient Hessian

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + f x i T x + 1 2 xt 2 f(x i ) x Gradient Hessian

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x Find the optimum analytically: c + H x = 0

Second-order methods First-order approximation (~ differentiation): f x f x i + c T x Second order approximation (~ double differentiation): f x f x i + c T x + 1 2 xt H x Find the optimum analytically: c + H x = 0 x = H 1 c

Second-order methods Newton s method: x i = H(x i ) 1 f x i

Second-order methods Newton s method: x i = H 1 c

Second-order methods Gradient descent (fixed step): Newton s method: x i = μ c x i = H 1 c

Second-order methods Gradient descent (fixed step): Newton s method: x i = μ c x i = H 1 c Quasi-newton s methods: x i = R i c (where R i is an iteratively computed approximation to true inverse Hessian)

Second-order methods Gradient descent (fixed step): x i = μ f x i Newton s method: x i = H(x i ) 1 f x i Quasi-newton s methods: x i = R i f x i (where R i is an iteratively computed approximation to true inverse Hessian)

Convexity Even among differentiable functions, some are very unfriendly:

Convexity There is, however, a class of particularly nice convex functions:

Convexity

Convexity A strictly convex function has a unique minimum. Due to convexity it is easy to find this minimum. You don t even need differentiability! Many practically useful functions (e.g. norm) are convex.

Summary By now you should: Be capable of seeing the world as an optimization problem. Be prepared to apply optimization techniques in practice. Know: Global/local minimum/maximum. Convexity, differentiability, Fermat s theorem ;) Gradient, Gradient descent, Stochastic gradient descent, batch vs on-line updates, Hessian, Newton s method.

Summary 1. Optimization is important 2. Optimization is possible*

* The following material not covered in the lecture, but highly recommended for self-study.