Non-convex optimization. Issam Laradji

Similar documents
Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Day 3 Lecture 3. Optimizing deep networks

Higher-Order Methods

Mini-Course 1: SGD Escapes Saddle Points

Linear Convergence under the Polyak-Łojasiewicz Inequality

Neural Network Training

Linear Convergence under the Polyak-Łojasiewicz Inequality

Comparison of Modern Stochastic Optimization Algorithms

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Introduction to Unconstrained Optimization: Part 2

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Introduction to gradient descent

arxiv: v1 [math.oc] 9 Oct 2018

Introduction to Optimization

Optimized first-order minimization methods

Nonlinear Optimization for Optimal Control

On Nesterov s Random Coordinate Descent Algorithms - Continued

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Deep Learning & Artificial Intelligence WS 2018/2019

ECS171: Machine Learning

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

arxiv: v1 [math.oc] 1 Jul 2016

CPSC 540: Machine Learning

Stochastic Optimization Algorithms Beyond SG

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

J. Sadeghi E. Patelli M. de Angelis

Performance Surfaces and Optimum Points

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Optimization for neural networks

Non-Smooth, Non-Finite, and Non-Convex Optimization

Bayesian Paradigm. Maximum A Posteriori Estimation

8 Numerical methods for unconstrained problems

Nonlinear Optimization Methods for Machine Learning

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Selected Topics in Optimization. Some slides borrowed from

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Stochastic Subgradient Method

Reading Group on Deep Learning Session 1

Line Search Methods for Unconstrained Optimisation

Optimization for Training I. First-Order Methods Training algorithm

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Unconstrained minimization of smooth functions

Image restoration: numerical optimisation

A Conservation Law Method in Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

OPTIMIZATION METHODS IN DEEP LEARNING

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Gradient Methods Using Momentum and Memory

Lecture 5: September 12

Gradient Descent. Dr. Xiaowei Huang

CS260: Machine Learning Algorithms

Newton-MR: Newton s Method Without Smoothness or Convexity

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

arxiv: v2 [math.oc] 1 Nov 2017

Deep Feedforward Networks

CSC2541 Lecture 5 Natural Gradient

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Nonparametric Bayesian Methods (Gaussian Processes)

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Introduction to unconstrained optimization - direct search methods

Data Mining (Mineria de Dades)

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

CSC321 Lecture 7: Optimization

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

A summary of Deep Learning without Poor Local Minima

Sample questions for Fundamentals of Machine Learning 2018

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

CSC321 Lecture 8: Optimization

On the saddle point problem for non-convex optimization

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Lecture 6 Optimization for Deep Neural Networks

arxiv: v4 [math.oc] 11 Jun 2018

Numerical Optimization Techniques

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

GRADIENT = STEEPEST DESCENT

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization

SVRG Escapes Saddle Points

Algorithmic Stability and Generalization Christoph Lampert

Convergence Rate of Expectation-Maximization

Subgradient Method. Ryan Tibshirani Convex Optimization

Deep Learning II: Momentum & Adaptive Step Size

More Optimization. Optimization Methods. Methods

Chapter 2. Optimization. Gradients, convexity, and ALS

Unconstrained optimization

How hard is this function to optimize?

Coordinate Descent and Ascent Methods

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Transcription:

Non-convex optimization Issam Laradji

Strongly Convex Objective function f(x) x

Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x

Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Randomized coordinate descent

Non-strongly Convex optimization Assumptions Gradient Lipschitz continuous Convergence rate Compared to the strongly convex convergence rate

Non-strongly Convex optimization

Non-Strongly Convex Objective function Assumptions Lipschitz continuous Restricted secant inequality Randomized coordinate descent

Invex functions (a generalization of convex function) Objective function Assumptions Lipschitz continuous Polyak [1963] Invex function (one global minimum) This inequality simply requires that the gradient grows faster than a linear function as we move away from the optimal function value.

Invex functions (a generalization of convex function) Objective function Assumptions Lipschitz continuous Polyak [1963] - for invex functions where this holds Invex function (one global minimum) Randomized coordinate descent

Invex functions (a generalization of convex function) Objective function Invex Assumptions Lipschitz continuous Convex Polyak Polyak [1963] - for invex functions where this holds Venn diagram Randomized coordinate descent

Non-convex functions

Non-convex functions local maxima Global minimum Local minima

Non-convex functions Global minimum Local minima Strategy 1: local optimization of the non-convex function

Non-convex functions local maxima Global minimum Local minima Assumptions for local non-convex optimization Lipschitz continuous Locally convex

Non-convex functions local maxima Local randomized coordinate descent Global minimum Local minima Strategy 1: local optimization of the non-convex function All convex functions rates apply.

Non-convex functions local maxima Global minimum Local minima Strategy 1: local optimization of the non-convex function Issue: dealing with saddle points

Non-convex functions Global minimum Local minima Strategy 2: Global optimization of the non-convex function Issue: Exponential number of saddle points

Local non-convex optimization Gradient Descent Difficult to define a proper step size Newton method Newton method solves the slowness problem by rescaling the gradients in each direction with the inverse of the corresponding eigenvalues of the hessian can result in moving in the wrong direction (negative eigenvalues) Saddle-Free Newton s method rescales gradients by the absolute value of the inverse Hessian and the Hessian s Lanczos vectors.

Local non-convex optimization Random stochastic gradient descent Sample noise r uniformly from unit sphere Escapes saddle points but step size is difficult to determine Cubic regularization [Nesterov 2006] Gradient Lipschitz continuous Hessian Lipschitz continuous

Local non-convex optimization Random stochastic gradient descent Sample noise r uniformly from unit sphere Escapes saddle points but step size is difficult to determine Momentum can help escape saddle points (rolling ball)

Global non-convex optimization Matrix completion problem [De Sa et al. 2015] Applications (usually for datasets with missing data) matrix completion Image reconstruction recommendation systems.

Global non-convex optimization Matrix completion problem [De Sa et al. 2015] Reformulate it as (unconstrained non-convex problem) Gradient descent

Global non-convex optimization Reformulate it as (unconstrained non-convex problem) Gradient descent Using the Riemannian manifold, we can derive the following This widely-used algorithm converges globally, using only random initialization

Convex relaxation of non-convex functions optimization Convex Neural Networks [Bengio et al. 2006] Single-hidden layer network original neural networks non-convex problem

Convex relaxation of non-convex functions optimization Convex Neural Networks [Bengio et al. 2006] Single-hidden layer network Use alternating minimization Potential issues with the activation function

Bayesian optimization (global non-convex optimization) Typically used for finding optimal parameters Determining the step size of # hidden layers for neural networks The parameter values are bounded? Other methods include sampling the parameter values random uniformly Grid-search

Bayesian optimization (global non-convex optimization) Fit Gaussian process on the observed data (purple shade) Probability distribution on the function values

Bayesian optimization (global non-convex optimization) Fit Gaussian process on the observed data (purple shade) Probability distribution on the function values Acquisition function (green shade) a function of the objective value (exploitation) in the Gaussian density function; and the uncertainty in the prediction value (exploration).

Bayesian optimization Slower than grid-search with low level of smoothness (illustrate) Faster than grid-search with high level of smoothness (illustrate)

Bayesian optimization Slower than grid-search with low level of smoothness (illustrate) Faster than grid-search with high level of smoothness (illustrate) Grid-search Bayesian optimization

Bayesian optimization Slower than grid-search with low level of smoothness (illustrate) Faster than grid-search with high level of smoothness (illustrate) A measure of smoothness Grid-search Bayesian optimization

Summary Non-convex optimization Strategy 1: Local non-convex optimization Convexity convergence rates apply Escape saddle points using, for example, cubic regularization and saddle-free newton update Strategy 2: Relaxing the non-convex problem to a convex problem Convex neural networks Strategy 3: Global non-convex optimization Bayesian optimization Matrix completion