Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Similar documents
Recent Advances in Non-Convex Optimization and its Implications to Learning

Day 3 Lecture 3. Optimizing deep networks

Non-convex optimization. Issam Laradji

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Basic Principles of Unsupervised and Unsupervised

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Neural Network Training

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Mini-Course 1: SGD Escapes Saddle Points

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Eigen Vector Descent and Line Search for Multilayer Perceptron

ECS171: Machine Learning

Gradient Descent. Dr. Xiaowei Huang

CPSC 540: Machine Learning

Performance Surfaces and Optimum Points

A summary of Deep Learning without Poor Local Minima

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Nonlinear Optimization Methods for Machine Learning

Numerical Optimization Techniques

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC321 Lecture 8: Optimization

Introduction to Convolutional Neural Networks (CNNs)

Based on the original slides of Hung-yi Lee

Optimization for neural networks

Gradient Descent. Sargur Srihari

STA141C: Big Data & High Performance Statistical Computing

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

OPTIMIZATION METHODS IN DEEP LEARNING

Advanced computational methods X Selected Topics: SGD

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC321 Lecture 7: Optimization

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Deep Feedforward Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

ECS289: Scalable Machine Learning

= w 2. w 1. B j. A j. C + j1j2

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Multilayer Perceptron Learning Utilizing Singular Regions and Search Pruning

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Unconstrained Optimization

Lecture 5: Logistic Regression. Neural Networks

Tips for Deep Learning

Introduction to gradient descent

Deep Learning & Neural Networks Lecture 4

CS 6375 Machine Learning

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

RegML 2018 Class 8 Deep learning

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Neural Networks: Backpropagation

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

On the fast convergence of random perturbations of the gradient flow.

Comparison of Modern Stochastic Optimization Algorithms

arxiv: v1 [math.oc] 12 Oct 2018

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Neural Networks Teaser

Why should you care about the solution strategies?

Reward-modulated inference

Linear Regression (continued)

Deep Learning & Artificial Intelligence WS 2018/2019

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

A picture of the energy landscape of! deep neural networks

Overparametrization for Landscape Design in Non-convex Optimization

Data Mining & Machine Learning

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

arxiv: v4 [math.oc] 24 Apr 2017

Bounded Approximation Algorithms

SVRG Escapes Saddle Points

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Neural Networks

CSC242: Intro to AI. Lecture 21

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Local minima and plateaus in hierarchical structures of multilayer perceptrons

Tips for Deep Learning

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lecture 6 Optimization for Deep Neural Networks

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Artificial Neural Networks. MGS Lecture 2

Regularization in Neural Networks

Normalization Techniques in Training of Deep Neural Networks

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

arxiv: v2 [math.oc] 1 Nov 2017

On the saddle point problem for non-convex optimization

Optimization Methods

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

1 What a Neural Network Computes

Least Mean Squares Regression. Machine Learning Fall 2018

Artificial Neural Networks. Edward Gatt

Gradient Descent. Mohammad Emtiyaz Khan EPFL Sep 22, 2015

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

A random perturbation approach to some stochastic approximation algorithms in optimization.

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Transcription:

Non-Convex Optimization in Machine Learning Jan Mrkos AIC

The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff?

Neural net universal approximation Theorem (1989): Let σ. be bounded, monotonically increasing and continuous, Let C I m be space of continuous functions over unit interval in R m. Then for any f C I m and ε > 0 exists: F x = v i σ(w T i x + b i ) Such that F x f x i=1, N < ε

Neural net universal approximation Theorem (1989): Let σ. be bounded, monotonically increasing and continuous, Let C I m be space of continuous functions over unit interval in R m. Then for any f C I m and ε > 0 exists: F x = v i σ(w T i x + b i ) Such that F x f x i=1, N < ε But training Neural Networks turns out to be NP hard

* Reza Zadeh, https://www.oreilly.com/ideas/the-hard-thing-about-deep-learning

ML and Optimization Unsupervised learning Clustering - minimizing within cluster sum of squares Supervised learning Neural nets - minimizing loss / generalization error

The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff?

Part 2: Convex vs. Non-convex Source: https://www.facebook.com/nonconvex

Convex vs. Non-convex Convex Non-convex One global minimum Multiple local optima

Convex vs. Non-convex Convex Non-convex How do you find the global optima?

Non convex Optimization Critical point {x x f x = 0}

Saddle points How many saddle points are there? Curse of dimensionality for general problems, number of critical points increases exponentially with the dimension* *Dauphin et al. - Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Saddle points in ML problems Loss f invariant to permutations: f x 1, x 2, x k = f x 2, x 1, x k, x i R n k - hidden neurons with n=2 weights

Saddle points in ML problems For convex function, average of optimal solutions -> optimal Non convex function: Average of optimal solutions: 0 gradient, but not optimal -> saddle point. Image from http://www.offconvex.org/2016/03/22/saddlepoints/

Saddle points in ML problems Some invariances in NN: Neuron permutations Layer scaling -> Number of critical points grows exponentially with number of neurons! Universal approximation, but No free lunch

Saddle points in ML problems Some invariances in NN: Neuron permutations Layer scaling -> Number of critical points grows exponentially with number of neurons! Universal approximation, but No free lunch Overspecified models (such as for dropout)*: Critical points in smaller networks create critical points in bigger nets Even global minimum can generate saddle points *Fukumizu, Amari: Local minima and plateaus in hierarchical structures of multilayer perceptrons

Saddle points in ML problems Overspecified models: Critical points in smaller networks create critical points in bigger nets Even global minimum can generate saddle points EXAMPLE*: * Anandkumar: Nonconvex optimization: Challenges and Recent Successes - ICML2016 Tutorial

The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff?

Part 3: Learning with saddle points Saddle points lead to slow learning (long time spent in saddle manifold) How to deal with saddles?

Saddle point or local optima? Function to optimize f x, with critical points x f x = 0 Minimum Saddle f y f x + x f x, y x + 1 2 y x T 2 f x (y x)

Saddle point or local optima? f y f x + x f x, y x + 1 2 y x T 2 f x (y x) Minimum Saddle Local minima: critical point with 2 f x > 0 Non degenerate saddle point: critical point where 2 f x has strictly positive and negative eigenvalues Indeterminate critical points points: zeros as eigenvalues. Anyway, why are critical points also called stationary?

Saddle point or local optima? f y f x + x f x, y x + 1 2 y x T 2 f x (y x) Minimum Saddle Local minima: critical point with 2 f x > 0 Non degenerate saddle point: critical point where 2 f x has strictly positive and negative eigenvalues Indeterminate critical points points: zeros as eigenvalues. Anyway, why are critical points also called stationary?

Gradient descent x n = x n 1 μ x f x

Gradient descent Gets stuck at the saddle. But for non degenerate saddle, there is u such that u T 2 f x u < 0 Negative eigen-value -> direction of escape We need a second order method.

Newton method Well known second order method: x n = x n 1 H x 1 x f x Where H x = 2 f x Faster convergence than gradient descent Great for convex problems Newtons method Gradient descent Images: left: Anandkumar: Nonconvex optimization: Challenges and Recent Successes - ICML2016 Tutorial right: wikipedia

Newton method - example Local quadratic approximation Assume x non-degenerate saddle (negative eigenvalues in Hessian) and let Δv i change along eigenvector v i : Gradient Descent f x + Δx f x + 1 2 i λ i Δv i 2 Very slow in the neighborhood of x due to small gradient Moves down due to negative values of λ i Newtons Method Multiplies each eigendirection with λ i 1. This gives better convergence, but also reverses the sign of down going directions. i.e. converges on the saddle point

Escaping saddle point Strategy Use gradient descent when gradient large When gradient becomes small, move along eigenvector λ i < 0 to escape Sophisticated alternative: Cubic regularization of Newton method

Escaping saddle point Strategy Use gradient descent when gradient large When gradient becomes small, move along eigenvector λ i < 0 to escape Sophisticated alternative: Cubic regularization of Newton method

Escaping saddle point Strategy Use gradient descent when gradient large When gradient becomes small, move along eigenvector λ i < 0 to escape Sophisticated alternative: Cubic regularization of Newton method

The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff?

Part 4: How fast can we converge to local min? Generally, hard to say

Class of strict saddle functions A function f(x) is strict saddle if all points x satisfy at least one of the following 1. Gradient f(x) is large. 2. Hessian 2 f(x) has a negative eigenvalue that is bounded away from 0. 3. Point x is near a local minimum.

Class of strict saddle functions A function f(x) is strict saddle if all points x satisfy at least one of the following 1. Gradient f(x) is large. 2. Hessian 2 f(x) has a negative eigenvalue that is bounded away from 0. 3. Point x is near a local minimum. For strict saddle function, cubic regularization finds local minimum in polynomial time.

Simple is better Second order algorithms are powerful, but computing second derivative is expensive Observation: Non degenerate saddle points are unstable Noisy gradient descent will not get stuck! x n = x n 1 μ x f x + ε No wonder stochastic gradient for NN works!

Noisy gradient descent - speed Result 1 (Ge 2015): Noisy gradient descent can find a local minimum of strict saddle functions in polynomial time.

Noisy gradient descent - speed Result 1 (Ge 2015): Noisy gradient descent can find a local minimum of strict saddle functions in polynomial time. However, polynomial dependency on dimension and smallest eigenvalue of hessian is high and not practical. Gradient descent still slow around saddle points.

Gradient descent not that bad? Result 2 (Lee et al. 2016): For strict saddle functions, set of points from which gradient descent will converge to saddle is measure 0.

Gradient descent not that bad? Result 2 (Lee et al. 2016): For strict saddle functions, set of points from which gradient descent will converge to saddle is measure 0.

Other possible saddle points So we know nothing. Also, other saddle points: Image: Rong Ge: http://www.offconvex.org/2016/03/22/saddlepoints/

Summary In ML non convex problems, critical points are ubiquitous. Symmetries and overspecification make stuff even worse Saddles make convergence slower Escaping saddles noisy gradient descent and second order methods Time complexity estimates for some problems. General case for higher order (>3) saddle points still NP-hard.

Open questions How do we make optimization algorithms that work even when there are degenerate saddle points? Are there other guarantees for time complexity outside strict saddle functions? What are the saddle structures of popular problems, e.g. deep learning? Can noisy SGD escape higher order saddle points in bounded time? Noisy SGD has worse scaling with dimension vs. second order methods. Can this be improved?

Resources ICML turial on non-convex optimization: http://newport.eecs.uci.edu/anandkumar/icml2016tutorial.html Result 1: http://www.offconvex.org/2016/03/22/saddlepoints/ https://arxiv.org/abs/1503.02101 Result 2: http://www.offconvex.org/2016/03/24/saddles-again/ https://arxiv.org/abs/1602.04915 https://www.oreilly.com/ideas/the-hard-thing-about-deep-learning Convexity and deep NN: https://rinuboney.github.io/2015/10/18/theoreticalmotivations-deep-learning.html Nice paper: Netorov, Polyak, 2006, Cubic regularization of Newton method and its global performance Plots and animations in matplotlib