A summary of Deep Learning without Poor Local Minima

Similar documents
Bits of Machine Learning Part 1: Supervised Learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Machine Learning

Advanced Machine Learning

SGD and Deep Learning

Machine Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Artificial Neural Networks. MGS Lecture 2

ECE521 Lectures 9 Fully Connected Neural Networks

Deep Feedforward Networks

Lecture 5: Logistic Regression. Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Understanding Neural Networks : Part I

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Neural Network Training

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to Machine Learning Spring 2018 Note Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CSCI567 Machine Learning (Fall 2018)

Neural Networks and the Back-propagation Algorithm

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Artificial Neural Networks

Neural Networks (Part 1) Goals for the lecture

1 What a Neural Network Computes

Neural networks. Chapter 20. Chapter 20 1

Lecture 17: Neural Networks and Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Reading Group on Deep Learning Session 1

Neural networks COMS 4771

Deep Feedforward Networks

Course 395: Machine Learning - Lectures

Artificial Intelligence

Multilayer Perceptrons and Backpropagation

Jakub Hajic Artificial Intelligence Seminar I

Introduction to Neural Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

ECS171: Machine Learning

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks: Backpropagation

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Machine Learning

CS489/698: Intro to ML

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Feedforward Neural Nets and Backpropagation

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

CSC242: Intro to AI. Lecture 21

Introduction to Convolutional Neural Networks (CNNs)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Advanced computational methods X Selected Topics: SGD

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Machine Learning Basics III

CSC 411 Lecture 10: Neural Networks

Introduction to Deep Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Lecture 3 Feedforward Networks and Backpropagation

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

NEURAL NETWORKS

Machine Learning (CSE 446): Backpropagation

Neural Networks and Deep Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Multilayer Perceptron

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Unit 8: Introduction to neural networks. Perceptrons

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Introduction to (Convolutional) Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Artificial Neural Networks

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Grundlagen der Künstlichen Intelligenz

Machine Learning (CSE 446): Neural Networks

J. Sadeghi E. Patelli M. de Angelis

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Introduction to Deep Learning

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Nets Supervised learning

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Tutorial on Machine Learning for Advanced Electronics

UNDERSTANDING LOCAL MINIMA IN NEURAL NET-

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Neural Networks

MLPR: Logistic Regression and Neural Networks

Artificial Neuron (Perceptron)

Transcription:

A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016

Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given a labeled set of input-ouput pairs (the training set) D n = {(X i, Y i ), i = 1,..., n} We learn the classification function f = 1 if versicolor, f = 1 if virginica 1

Supervised learning Training set: D n = {(X i, Y i ), i = 1,..., n} - Input features: X i R d - Output: Y i Y i Y { R finite regression (price, position, etc) classification (type, mode, etc) y is a non-deterministic and complicated function of x i.e., y = f(x, z) where z is unknown (e.g. noise). Goal: learn f. Learning algorithm: 2

Objectives Empirical risk: ˆR(f) := 1 n n i=1 f(x i) Y i 2 2. Look for the mapping in a class of functions F that minimizes the risk (or a regularized version of it): f arg min f F ˆR(f). 3

Neural networks Loosely inspired by how the brain works 1. Construct a network of simplified neurones, with the hope of approximating and learning any possible function 1 Mc Culloch-Pitts, 1943 4

The perceptron The first artificial neural network with one layer, and σ(x) = sgn(x) (classification) Input x R d, output in { 1, 1}. Can represent separating hyperplanes. 5

Multilayer perceptrons They can represent any function of R d to { 1, 1}... but the structure depends on the unknown target function f, and is difficult to optimise 6

From perceptrons to neural networks... and the number of layers can rapidly grow with the complexity of the function A key idea to make neural networks practical: soft-thresholding... 7

Soft-thresholding Replace hard-thresholding function σ by smoother functions Theorem (Cybenko 1989) Any continuous function f from [0, 1] d to R can be approximated as a function of the form: N j=1 α jσ(w j x + b j), where σ is any sigmoid function. 8

Soft-thresholding Cybenko s theorem tells us that f can be represented using a single hidden layer network... A non-constructive proof: how many neurones do we need? Might depend on f... 9

Neural networks A feedforward layered network (deep learning = enough layers) 10

Deep Learning and the ILSVR challenge Deep learning outperformed any other techniques in all major machine learning competitions (image classification, speech recognition and natural language processing) The ImageNet Large Scale Visual Recognition Challenge (ILSVRC). 1. Training: 1.2 million images (227 227), labeled one out of 1000 categories 2. Test: 100.000 images (227 227) 3. Error measure: The teams have to predict 5 (out of 1000) classes and an image is considered to be correct if at least one of the predictions is the ground truth. 2 11

ILSVR challenge 1 From Stanford CS231n lecture notes 12

Architectures 13

Architectures 14

Computing with neural networks Layer 0: inputs x = (x (0) 1,..., x(0) ) and x(0) 0 = 1 x (l) i d Layer 1,..., L 1: hidden layer l, d (l) + 1 nodes, state of node i, with x (l) 0 = 1 Layer L: output y = x (L) 1 Signal at k: s (l) k State at k: x (l) k = d (l 1) i=0 w (l) ik x(l 1) i = σ(s (l) k ) Output: the state of y = x (L) 1 15

Training neural networks The output of the network is a function of w = (w (l) ij ) i,j,l: y = f w (x) We wish to optimise over w to find the most accurate estimation of the target function Training data: (X 1, Y 1 ),..., (X n, Y n ) R d { 1, 1} Objective: find w minimising the empirical risk: E(w) := R(f w ) = 1 2n n f w (X l ) Y l 2 l=1 16

Stochastic Gradient Descent E(w) = 1 n 2n l=1 E l(w) where E l (w) := f w (X l ) Y l 2 In each iteration of the SGD algorithm, only one function E l is considered... Parameter. learning rate α > 0 1. Initialization. w := w 0 2. Sample selection. Select l uniformly at random in {1,..., n} 3. GD iteration. w := w α E l (w), go to 2. Is there an efficient way of computing E l (w)? 17

Backpropagation We fix l, and introduce e(w) = E l (w). Let us compute e(w): e w (l) ij = e s (l) j }{{} :=δ (l) j s(l) j w (l) ij }{{} =x (l 1) i The sensitivity of the error w.r.t. the signal at node j can be computed recursively... 18

Backward recursion Output layer. δ (L) 1 := e s (L) 1 From layer l to layer l 1. δ (l 1) i := e = s (l 1) i and e(w) = (σ(s (L) 1 ) Y l ) 2 δ (L) 1 = 2(x (L) 1 Y l )σ (s (L) 1 ) d (l) j=1 e s (l) j }{{} :=δ (l) j s(l) j x (l 1) i }{{} =w (l) ij x(l 1) i s (l 1) i }{{} =σ (s (l 1) i ) Summary. E l w (l) ij d (l) = δ (l) j x (l 1) i, δ (l 1) i = j=1 δ (l) j w (l) ij σ (s (l 1) i ) 19

Backpropagation algorithm Parameter. Learning rate α > 0 Input. (X 1, Y 1 ),..., (X n, Y n ) R d { 1, 1} 1. Initialization. w := w 0 2. Sample selection. Select l uniformly at random in {1,..., n} 3. Gradient of E l. x (0) i := X li for all i = 1,... d Forward propagation: compute the state and signal at each node (x (l) i, s (l) i ) Backward propagation: propagate back Y l to compute δ (l) at each node and the partial derivative E l w (l) ij 4. GD iteration. w := w α E l (w), go to 2. i 20

Example: tensorflow http://playground.tensorflow.org/ 21

Deep Learning without Local Minima Critical question: The SGD algorithm will converge to a global minimum of the risk, if we can guarantee that local minima have the same risk as a global minimum. What does the loss surface look like? Related work: P. Baldi, K. Hornik. Neural Networks and PCA: Learning from Examples without Local Minima. Neural Networks, 1989. I. Goodfellow, Y. Bengio, A. Courville. Deep Learning, http://www.deeplearningbook.org A. Choromanska et al.. The Loss Surface of Multilayer Networks. ICML 2015. 22

Notations Data: X i R dx, Y i R dy, m data points X: d x m matrix whose columns are the X i s Y : d y m matrix whose columns are the Y i s H hidden layers Layer k with d k neurons, input weight matrix W k R d k d k 1 p = min{d 1,..., d H } Output: Ŷ (W, X) = qσ H+1 (W H+1 σ(w H σ(w H 1......σ(W 2 σ(w 1 X)...) Linear activation function: Ŷ (W, X) = W H+1... W 1 X. 23

Baldi-Hornik: 1-hidden linear networks Linear regression: fitting a linear model to the data. X i R dx, Y i R dy. Find the matrix L R dy dx minimizing L(L) = m i=1 Y i LX i 2 2. When XX is invertible, L is equal to L = Y X (XX ) 1. Convexity of L. Now in a 1-hidden layer network, we are looking for L that can be factorized as W 2 W 1 where W 1 R p dx and W 2 R dy p. In particular the rank of L is at most p. Non uniqueness: W 1 = CW 1 and W 2 = W 2 C 1 work as well. 24

Baldi-Hornik: 1-hidden linear networks Define the d y d y matrix Σ = Y X (XX ) 1 XY (covariance matrix of the best unconstrained linear approximation of Y ). d x = d y. Theorem (Baldi-Hornik 1989) Assume that Σ is full rank, and has d y distinct eigenvalues λ 1 >... > λ dy. Let W 1 and W 2 define a critical point of L(W 1, W 2 ). Then there exists a subset Γ of p (orthonormal) eigenvectors of Σ, and a p p invertible matrix C such that: W 2 = U Γ C, W 1 = C 1 U Γ Y X (XX ) 1, where U Γ is the matrix formed by the eigenvectors in Γ. Moreover L(W 1, W 2 ) = trace(y Y ) i Γ λ i. 25

Baldi-Hornik: 1-hidden linear networks Theorem (Baldi-Hornik 1989) Assume that Σ is full rank, and has d y distinct eigenvalues λ 1 >... > λ dy. Let W 1 and W 2 define a critical point of L(W 1, W 2 ). Then there exists a subset Γ of p (orthonormal) eigenvectors of Σ, and a p p invertible matrix C such that: W 2 = U Γ C, W 1 = C 1 U Γ Y X (XX ) 1, where U Γ is the matrix formed by the eigenvectors in Γ. Moreover L(W 1, W 2 ) = trace(y Y ) i Γ λ i. Up to C, the global minimizer is unique, and is the projection on the subspace spanned by the p top eigenvectors of Σ of the ordinary least square regression matrix! Taking an other set of eigenvectors for the projection yields a saddle point. 26

This paper: networks with any depth and width Theorem (Kawaguchi 2016) Assume that XX and XY are full rank, d x d y. Assume that Σ is full rank, and has d y distinct eigenvalues. The loss function L(W 1,..., W H+1 ) satisfies: (i) it is non-convex and non-concave. (ii) Every local minimum is a global minimum. (iii) Every critical point that is not a minimum is a saddle point. (iv) If rank(w H,..., W 2 ) = p, then the Hessian at any saddle point has at least one strictly negative eigenvalue. 27

Proof sketch: example Assume that W is a critical point and a local minimum, and that rank(w H... W 2 ) = p Necessary conditions: L = 0 and 2 L positive semidefinite. From the latter conditions, we deduce that X(Ŷ (W, X) Y ) = 0. Go back to the unconstrained linear case: f(w ) = W X Y 2 F for W R dy dx. Let r = (W X Y ) denote the error matrix. By convexity, if Xr = 0 then W is a global minimizer of f. Now with W = W H+1... W 1, we have Xr = Xr = 0, and hence W is a global minimizer of f. 28

This paper: non-linear networks with any depth and width Rectified linear activation function: σ(x) = max(0, x). Output: Ŷ (W, X) = qσ H+1 (W H+1 σ(w H σ(w H 1......σ(W 2 σ(w 1 X)...) An other way of writing the output: Ŷ (W, X) = q d x i=1 j=1 γ X i,j A i,j H k=1 W (k) i,j, where the first sum is voer the input coordinates, the second sum is on the path from the i-th input to the output, X i,j = X i,1 for all j is the i-th input, and A i,j is the activation (binary variable) of the path j for input i 29

This paper: non-linear networks with any depth and width Critical simplification: the A i,j s are independent Bernoulli r.v. with mean ρ! Under this assumption (among others), there is an equivalence with a linear network. The previous theorem holds... 30

Summary SGD could well find global minimizer of the empirical risk, under some conditions... What is the impact of regularization? What about other activation functions? 31