Failures of Gradient-Based Deep Learning

Similar documents
Introduction to Machine Learning (67577)

Accelerating Stochastic Optimization

Lecture 17: Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Machine Learning Spring 2018 Note Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

SGD and Deep Learning

Neural Network Training

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC321 Lecture 8: Optimization

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Convolutional Neural Networks

Machine Learning in the Data Revolution Era

OPTIMIZATION METHODS IN DEEP LEARNING

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Introduction to Machine Learning (67577) Lecture 3

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Slide credit from Hung-Yi Lee & Richard Socher

Neural Networks: Backpropagation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

On the tradeoff between computational complexity and sample complexity in learning

Introduction to Convolutional Neural Networks (CNNs)

A summary of Deep Learning without Poor Local Minima

UNSUPERVISED LEARNING

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CSC321 Lecture 7: Optimization

Lecture 5: Logistic Regression. Neural Networks

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Optimization and Gradient Descent

Day 3 Lecture 3. Optimizing deep networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Artificial Neuron (Perceptron)

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

ECE521 Lectures 9 Fully Connected Neural Networks

Online Learning Summer School Copenhagen 2015 Lecture 1

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

CS489/698: Intro to ML

ECS289: Scalable Machine Learning

Dimensionality Reduction and Principle Components Analysis

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Introduction to (Convolutional) Neural Networks

Neural Networks and Deep Learning

Lecture 5 Neural models for NLP

Stochastic Optimization Algorithms Beyond SG

Intro to Neural Networks and Deep Learning

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

The K-FAC method for neural network optimization

Introduction to Machine Learning (67577) Lecture 7

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Efficient Bandit Algorithms for Online Multiclass Prediction

Statistical Machine Learning from Data

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Introduction to gradient descent

Neural networks and optimization

Deep Feedforward Networks

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Trade-Offs in Distributed Learning and Optimization

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

From perceptrons to word embeddings. Simon Šuster University of Groningen

Understanding How ConvNets See

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

LASSO Review, Fused LASSO, Parallel LASSO Solvers

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

1 What a Neural Network Computes

Regression with Numerical Optimization. Logistic

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Machine Learning Basics III

Advanced Machine Learning

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Lecture 35: Optimization and Neural Nets

Linear discriminant functions

Deep Learning Lab Course 2017 (Deep Learning Practical)

Big Data Analytics: Optimization and Randomization

Feedforward Neural Networks

Linear Methods for Regression. Lijun Zhang

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

CPSC 540: Machine Learning

Lecture 3 Feedforward Networks and Backpropagation

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Lecture 3 Feedforward Networks and Backpropagation

CSC321 Lecture 4: Learning a Classifier

SVRG++ with Non-uniform Sampling

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Coordinate Descent and Ascent Methods

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Stochastic Gradient Descent with Variance Reduction

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Mollifying Networks. ICLR,2017 Presenter: Arshdeep Sekhon & Be

Deep Learning (CNNs)

Transcription:

Failures of Gradient-Based Deep Learning Shai Shalev-Shwartz, Shaked Shammah, Ohad Shamir The Hebrew University and Mobileye Representation Learning Workshop Simons Institute, Berkeley, 2017 Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 1 / 38

Deep Learning is Amazing... Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 2 / 38

Deep Learning is Amazing... Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 3 / 38

Deep Learning is Amazing... Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 4 / 38

Deep Learning is Amazing... Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 5 / 38

Deep Learning is Amazing... INTEL IS BUYING MOBILEYE FOR $15.3 BILLION IN BIGGEST ISRAELI HI-TECH DEAL EVER Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 6 / 38

This Talk Simple problems where standard deep learning either Does not work well Requires prior knowledge for better architectural/algorithmic choices Requires other than gradient update rule Requiers to decompose the problem and add more supervision Does not work at all No local-search algorithm can work Even for nice distributions and well-specified models Even with over-parameterization (a.k.a. improper learning) Mix of theory and experiments Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 7 / 38

Outline 1 Piece-wise Linear Curves 2 Flat Activations 3 End-to-end Training 4 Learning Many Orthogonal Functions Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 8 / 38

Piecewise-linear Curves: Motivation Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 9 / 38

Piecewise-linear Curves Problem: Train a piecewise-linear curve detector Input: f = (f(0), f(1),..., f(n 1)) where k f(x) = a r [x θ r ] +, θ r {0,..., n 1} r=1 Output: Curve parameters {a r, θ r } k r=1 Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 10 / 38

First try: Deep AutoEncoder Encoding network, E w1 : Dense(500,relu)-Dense(100,relu)-Dense(2k) Decoding network, D w2 : Dense(100,relu)-Dense(100,relu)-Dense(n) Squared Loss: (D w2 (E w1 (f)) f) 2 Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 11 / 38

First try: Deep AutoEncoder Encoding network, E w1 : Dense(500,relu)-Dense(100,relu)-Dense(2k) Decoding network, D w2 : Dense(100,relu)-Dense(100,relu)-Dense(n) Squared Loss: (D w2 (E w1 (f)) f) 2 Doesn t work well... 500 iterations 10,000 iterations 50,000 iterations Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 11 / 38

Second try: Pose as a Convex Objective Problem: Train a piecewise-linear curve detector Input: f R n where f(x) = k r=1 a r[x θ r ] + Output: Curve parameters {a r, θ r } k r=1 Convex Formulation Let p R n,n be a k-sparse vector whose k max/argmax elements are {a r, θ r } k r=1 Observe: f = W p where W R n,n is s.t. W i,j = [i j + 1] + Learning approach linear regression: Train a one-layer fully connected network on (f, p) examples: min U E [ (Uf p) 2] = E [ (Uf W 1 f) 2] Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 12 / 38

Second try: Pose as a Convex Objective min U E [ (Uf p) 2] = E [ (Uf W 1 f) 2] Convex; Realizable; but still doesn t work well... 500 iterations 10,000 iterations 50,000 iterations Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 13 / 38

What went wrong? Theorem The convergence of SGD is governed by the condition number of W W, which is large: λ max (W W ) λ min (W W ) = Ω(n3.5 ) SGD requires Ω(n 3.5 ) iterations to reach U s.t. E[U] W 1 < 1/2 Note: Adagrad/Adam doesn t work because they perform diagonal conditioning Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 14 / 38

3rd Try: Pose as a Convolution p = W 1 f Observation: W 1 = 1 0 0 0 2 1 0 0 1 2 1 0 0 1 2 1 0 0 1 2... W 1 f is 1D convolution of f with 2nd derivative filter (1, 2, 1) Can train a one-layer convnet to learn filter (problem in R 3!) Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 15 / 38

Better, but still doesn t work well... 500 iterations 10,000 iterations 50,000 iterations Theorem: Condition number reduced to Θ(n 3 ). Convolutions aid geometry! But, Θ(n 3 ) is very disappointing for a problem in R 3... Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 16 / 38

4th Try: Convolution + Preconditioning ConvNet is equivalent to solving min x E [ (F x p) 2] where F is n 3 matrix with (f(i 1), f(i), f(i + 1)) at row i Observation: Problem is now low-dimensional, so can easily precondition Compute empirical approximation C of E[F F ] Solve min x E[(F C 1/2 x p) 2 ] Return C 1/2 x Condition number of F C 1/2 is close to 1 Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 17 / 38

Finally, it works... 500 iterations 10,000 iterations 50,000 iterations Remark: Use of convnet allows for efficient preconditioning Estimating and manipulating 3 3 rather than n n matrices. Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 18 / 38

Lesson Learned SGD might be extremely slow Prior knowledge allows us to: Choose a better architecture (not for expressivity, but for a better geometry) Choose a better algorithm (preconditioned SGD) Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 19 / 38

Outline 1 Piece-wise Linear Curves 2 Flat Activations 3 End-to-end Training 4 Learning Many Orthogonal Functions Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 20 / 38

Flat Activations Vanishing gradients due to saturating activations (e.g. in RNN s) Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 21 / 38

Flat Activations Problem: Learning x u( w, x ) where u is a fixed step function Optimization problem: min w E x [(u(n w(x)) u(w x)) 2 ] u (z) = 0 almost everywhere can t apply gradient-based methods Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 22 / 38

Flat Activations: Smooth approximation Smooth approximation: replace u with ũ (similar to using sigmoids as gates in LSTM s) u ũ Sometimes works, but slow and only approximate result. Often completely fails Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 23 / 38

Flat Activations: Perhaps I should use a deeper network... Approach: End-to-end min w E x [(N w(x) u(w x)) 2 ] (3 ReLU + 1 linear layers; 10000 iterations) Slow train+test time; curve not captured well Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 24 / 38

Flat Activations: Multiclass Approach: Multiclass N w (x): to which step does x belong min E[l(N w(x), y(x))] w x Problem capturing boundaries Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 25 / 38

Flat Activations: Forward only backpropagation Different approach (Kalai & Sastry 2009, Kakade, Kalai, Kanade, Shamir 2011): Gradient descent, but replace gradient with something else [ 1 ( ) ] 2 Objective: min E (u(w x)) u(w x) w x 2 ) ] Gradient: = E [(u(w x) u(w x) u (w x) x x ] Non-gradient direction: = Ex [(u(w x) u(w x))x Interpretation: Forward only backpropagation Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 26 / 38

Flat Activation (linear; 5000 iterations) Best results, and smallest train+test time Analysis (KS09, KKKS11): Needs O(L 2 /ɛ 2 ) iterations if u is L-Lipschitz Lesson learned: Local search works, but not with the gradient... Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 27 / 38

Outline 1 Piece-wise Linear Curves 2 Flat Activations 3 End-to-end Training 4 Learning Many Orthogonal Functions Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 28 / 38

End-to-End vs. Decomposition Input x: k-tuple of images of random lines f 1 (x): For each image, whether slope is negative/positive f 2 (x): return parity of slope signs Goal: Learn f 2 (f 1 (x)) Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 29 / 38

End-to-End vs. Decomposition Architecture: Concatenation of Lenet and 2-layer ReLU, linked by sigmoid End-to-end approach: Train overall network on primary objective Decomposition approach: Augment objective with loss specific to first net, using per-image labels Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 30 / 38

End-to-End vs. Decomposition Architecture: Concatenation of Lenet and 2-layer ReLU, linked by sigmoid End-to-end approach: Train overall network on primary objective Decomposition approach: Augment objective with loss specific to first net, using per-image labels 1 k = 1 1 k = 2 1 k = 3 0.3 0.3 0.3 20000 iterations Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 30 / 38

End-to-End vs. Decomposition Why end-to-end training doesn t work? Similar experiment by Gulcehre and Bengio, 2016 They suggest local minima problems We show that the problem is different Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 31 / 38

End-to-End vs. Decomposition Why end-to-end training doesn t work? Similar experiment by Gulcehre and Bengio, 2016 They suggest local minima problems We show that the problem is different Signal-to-Noise Ratio (SNR) for random initialization: End-to-end (red) vs. decomposition (blue), as a function of k SNR of end-to-end for k 3 is below the precision of float32 10 4 0 1 2 3 4 Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 31 / 38

Outline 1 Piece-wise Linear Curves 2 Flat Activations 3 End-to-end Training 4 Learning Many Orthogonal Functions Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 32 / 38

Learning Many Orthogonal Functions Let H be a hypothesis class of orthonormal functions: h, h H, E[h(x)h (x)] = 0 Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 33 / 38

Learning Many Orthogonal Functions Let H be a hypothesis class of orthonormal functions: h, h H, E[h(x)h (x)] = 0 (Improper) learning of H using gradient-based deep learning: Learn the parameter vector, w, of some architecture, p w : X R For every target h H, the learning task is to solve: min F h(w) := E[l(p w (x), h(x)] w x Start with a random w and update the weights based on F h (w) Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 33 / 38

Learning Many Orthogonal Functions Let H be a hypothesis class of orthonormal functions: h, h H, E[h(x)h (x)] = 0 (Improper) learning of H using gradient-based deep learning: Learn the parameter vector, w, of some architecture, p w : X R For every target h H, the learning task is to solve: min F h(w) := E[l(p w (x), h(x)] w x Start with a random w and update the weights based on F h (w) Analysis tool: How much F h (w) tells us about the identity of h? Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 33 / 38

Analysis: how much information in the gradient? min F h(w) := E[l(p w (x), h(x)] w x Theorem: For every w, there are many pairs h, h H s.t. E x [h(x)h (x)] = 0 while ( ) F h (w) F h (w) 2 1 = O H Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 34 / 38

Analysis: how much information in the gradient? min F h(w) := E[l(p w (x), h(x)] w x Theorem: For every w, there are many pairs h, h H s.t. E x [h(x)h (x)] = 0 while ( ) F h (w) F h (w) 2 1 = O H Proof idea: show that if the functions in H are orthonormal then, for every w, 2 ( ) 1 Var(H, F, w) := E F h (w) E F h (w) = O h h H To do so, express every coordinate of p w (x) using the orthonormal functions in H Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 34 / 38

Proof idea Assume the squared loss, then F h (w) = E x [(p w (x) h(x) p w (x)] = E[p w (x) p w (x)] E[h(x) p w (x)] } x {{} x independent of h Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 35 / 38

Proof idea Assume the squared loss, then F h (w) = E x [(p w (x) h(x) p w (x)] = E[p w (x) p w (x)] E[h(x) p w (x)] } x {{} x independent of h Fix some j and denote g(x) = j p w (x) Can expand g = H i=1 h i, g h i + orthogonal component Therefore, E h (E x [h(x) g(x)]) 2 Ex[g(x)2 ] H It follows that Var(H, F, w) E x[ p w (x) 2 ] H Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 35 / 38

Example: Parity Functions H is the class of parity functions over {0, 1} d and x is uniformly distributed There are 2 d orthonormal functions, hence there are many pairs h, h H s.t. E x [h(x)h (x)] = 0 while F h (w) F h (w) 2 = O(2 d ) Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 36 / 38

Example: Parity Functions H is the class of parity functions over {0, 1} d and x is uniformly distributed There are 2 d orthonormal functions, hence there are many pairs h, h H s.t. E x [h(x)h (x)] = 0 while F h (w) F h (w) 2 = O(2 d ) Remark: Similar hardness result can be shown by combining existing results: Parities on uniform distribution over {0, 1} d is difficult for statistical query algorithms (Kearns, 1999) Gradient descent with approximate gradients can be implemented with statistical queries (Feldman, Guzman, Vempala 2015) Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 36 / 38

Visual Illustration: Linear-Periodic Functions F h (w) = E x [(cos(w x) h(x)) 2 ] for h(x) = cos([2, 2] x), in 2 dimensions, x N (0, I): No local minima/saddle points However, extremely flat unless very close to optimum difficult for gradient methods, especially stochastic In fact, difficult for any local-search method Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 37 / 38

Summary Cause of failures: optimization can be difficult for geometric reasons other than local minima / saddle points Condition number, flatness Using bigger/deeper networks doesn t always help Remedies: prior knowledge can still be important Convolution can improve geometry (and not just sample complexity) Other than gradient update rule Decomposing the problem and adding supervision can improve geometry Understanding the limitations: While deep learning is great, understanding the limitations may lead to better algorithms and/or better theoretical guarantees For more information: Failures of Deep Learning : arxiv 1703.07950 github.com/shakedshammah/failures_of_dl Shai Shalev-Shwartz (huji,me) Failures of Gradient-Based DL Berkeley 17 38 / 38