Implicit Optimization Bias

Similar documents
Optimization geometry and implicit regularization

Overparametrization for Landscape Design in Non-convex Optimization

The Implicit Bias of Gradient Descent on Separable Data

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Characterizing Implicit Bias in Terms of Optimization Geometry

Implicit Regularization in Matrix Factorization

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning Basics

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

arxiv: v1 [cs.lg] 4 Oct 2018

Day 4: Classification, support vector machines

Computational and Statistical Learning Theory

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

SGD and Deep Learning

Computational and Statistical Learning Theory

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs)

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

STA141C: Big Data & High Performance Statistical Computing

Deep Learning Lecture 2

Day 3: Classification, logistic regression

Stochastic gradient descent; Classification

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Lecture 5: Logistic Regression. Neural Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Network Training

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Day 3 Lecture 3. Optimizing deep networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Theory of Deep Learning III: explaining the non-overfitting puzzle

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Matrix Factorizations: A Tale of Two Norms

Statistical Data Mining and Machine Learning Hilary Term 2016

CSCI567 Machine Learning (Fall 2018)

Discriminative Models

Machine Learning Basics III

Adaptive Gradient Methods AdaGrad / Adam

Summary and discussion of: Dropout Training as Adaptive Regularization

Neural Networks: Optimization & Regularization

Energy Landscapes of Deep Networks

Warm up: risk prediction with logistic regression

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

The K-FAC method for neural network optimization

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Introduction to Neural Networks

CSC321 Lecture 16: ResNets and Attention

Linear Models in Machine Learning

Based on the original slides of Hung-yi Lee

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Deep Learning & Artificial Intelligence WS 2018/2019

Linear and Logistic Regression. Dr. Xiaowei Huang

Theory of Deep Learning III: explaining the non-overfitting puzzle

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Deep learning on 3D geometries. Hope Yao Design Informatics Lab Department of Mechanical and Aerospace Engineering

Theory of Deep Learning III: the non-overfitting puzzle

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Computational Photography

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Artificial Neuron (Perceptron)

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Discriminative Models

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING

Neural Networks: Backpropagation

Logistic Regression. COMP 527 Danushka Bollegala

UNSUPERVISED LEARNING

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Classification with Perceptrons. Reading:

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

CSC321 Lecture 4: Learning a Classifier

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Accelerating Stochastic Optimization

ECS171: Machine Learning

CSC321 Lecture 4: Learning a Classifier

Ad Placement Strategies

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Normalization Techniques in Training of Deep Neural Networks

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

arxiv: v1 [cs.lg] 7 Jan 2019

Reading Group on Deep Learning Session 1

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

ECS289: Scalable Machine Learning

ECS171: Machine Learning

Machine Learning: Logistic Regression. Lecture 04

Big Data Analytics. Lucas Rego Drumond

Improved Bayesian Compression

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Some Statistical Properties of Deep Networks

Neural Networks and Deep Learning

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Gradient-Based Learning. Sargur N. Srihari

Transcription:

Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar, Blake Woodworth, Pedro Savarese (TTIC), Russ Salakhutdinov (CMU), Ashia Wilson, Becca Roelofs, Mitchel Stern, Ben Recht (Berkeley), Daniel Soudry, Elad Hoffer, Mor Shpigel (Technion), Jason Lee (USC)

Increasing the Network Size [Neyshabur Tomioka S ICLR 15]

Increasing the Network Size [Neyshabur Tomioka S ICLR 15]

Increasing the Network Size [Neyshabur Tomioka S ICLR 15]

Test Error Increasing the Network Size 1 0.5 0 Complexity (Path Norm?) [Neyshabur Tomioka S ICLR 15]

What is the relevant complexity measure (eg norm)? How is this minimized (or controlled) by the optimization algorithm? How does it change if we change the opt algorithm?

CIFAR-100 With Dropout SVHN CIFAR-10 MNIST 2.5 2 1.5 1 0.5 Cross-Entropy Training Loss 0.02 0.015 0.01 0.005 0/1 Training Error 0/1 Test Error 0.5 0.48 0.46 0.035 0.03 0.025 0.02 Path-SGD SGD 0 2.5 0 50 100 150 200 250 300 2 1.5 1 0.5 0 0 50 100 150 200 250 300 2.5 2 1.5 1 0.5 Epoch Epoch 0 0.20 50 100 150 200 250 300 0.15 0.1 0.05 0.44 0.42 Epoch Epoch 0.015 0.50 50 100 150 200 250 300 0.48 0.46 0.44 0.4 0.42 0 50 100 150 200 250 300 0 0 50 100 150 200 250 300 0.4 0.3 0.2 0.1 0.4 0 50 100 150 200 250 300 0.18 0.17 0.16 0.15 0.14 0.13 Epoch Epoch 5 0 0 100 200 300 400 0 0.80 100 200 300 400 0.12 0.750 100 200 300 400 4 0.6 3 2 1 0.4 0.2 0.7 0 0 100 200 300 400 Epoch 0 0 100 200 300 400 Epoch 0.65 0 100 200 300 400 [Neyshabur Salakhudtinov S NIPS 15]

Traini Error (Preplexity) Test Error (Preplexity) SGD vs ADAM Results on Penn Treebank using 3-layer LSTM [Wilson Roelofs Stern S Recht, The Marginal Value of Adaptive Gradient Methods in Machine Learning, NIPS 17]

The Deep Recurrent Residual Boosting Machine Joe Flow, DeepFace Labs Section 1: Introduction We suggest a new amazing architecture and loss function that is great for learning. All you have to do to learn is fit the model on your training data Section 2: Learning Contribution: our model The model class h w is amazing. Our learning method is: 1 arg min σ w m i=1 m loss(h w x ; y) (*) Section 3: Optimization This is how we solve the optimization problem (*): [ ] Section 4: Experiments It works!

Different optimization algorithm Different bias in optimum reached Different Inductive bias Different learning properties Goal: understand optimization algorithms not just as reaching some (global) optimum, but as reaching a specific optimum

Today Precisely understand implicit bias in: Matrix Factorization Linear Classification (Logistic Regression) Linear Convolutional Networks

Matrix Reconstruction min F W = A W y W R n n 2 2 A W i = A i, W A 1,, A m R n n y R m Matrix completion (A i is indicator matrix) Matrix reconstruction from linear measurements Multi-task learning (A i = e task of example i φ example i ) 2 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 y 4 2 4 2 3 1 4 3 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 We are interested in the regime m n 2 2 3 1 4 3 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1 Many global optima for which A W = y Easy to have A W = y without reconstruction/generalization - E.g. for matrix completion, set all unobserved entries to 0 Gradient Descent on W will generally yield trivial non-generalizing solution A 1 4 5 1 4 2 3 1 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 3 1 4 3 2 A 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 1 4 3 2 2 2A 1 3 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1

Factorized Matrix Reconstruction 2 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 3 1 4 3 2 y 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1 W = U V min f U, V = F U,V R n n UV = A UV 2 y 2 Since U, V full dim, no constraint on W, equivalent to min F(W) Underdetermined, all the same global min, trivial to minimize without generalizing What happens when we optimize by gradient descent on U, V? W

Gradient descent on f U, V gets to good global minima

Gradient descent on f U, V gets to good global minima Gradient descent on f U, V generalizes better with smaller step size

Question: Which global minima does gradient descent reach? Why does it generalize well?

Gradient descent on f(u, V) converges to a minimum nuclear norm solution

Conjecture: With stepsize 0 (i.e. gradient flow) and initialization 0, gradient descent on U converges to minimum nuclear norm solution: UU min W 0 W s. t. A X = y [Gunasekar Woodworth Bhojanapalli Neyshabur S 2017] Rigorous proof when A i s commute General A i : empirical validation + hand waving Yuanzhi Li, Hongyang Zhang and Tengyu Ma: proved when y = A(W ), W low rank, A RIP

Implicit Bias in Least Squared min Aw b 2 Gradient Descent (+Momentum) on w min Aw=b w 2 Gradient Descent on factorization W = UV AdaGrad on w probably min A W =b W tr with stepsize 0 and init 0, but only in limit, depends on stepsize, init, proved only in special cases in some special cases min Aw=b w, but not always, and it depends on stepsize, adaptation param, momentum Steepest Descent w.r.t. w??? Not min w, even as stepsize 0! Aw=b and it depends on stepsize, init, momentum Coordinate Descent (steepest descent w.r.t. w 1 ) Related to, but not quite the Lasso (with stepsize 0 and particular tie-breaking LARS)

Training Single Unit on Separable Data m arg min w R n L w = i=1 l z = log 1 + e z l y i w, x i Data x i, y m i i=1 linearly separable ( w i y i w, x i > 0) Where does gradient descent converge? w t = w t η L(w(t)) inf L w = 0, but minima unattainable GD diverges to infinity: w t, L w t 0 In what direction? What does Theorem: w t w t 2 w t w t converge to? w w 2 w = arg min w 2 s. t. i y i w, x i 1

Other Objectives and Opt Methods Single linear unit, logistic loss hard margin SVM solution (regardless of init, stepsize) Multi-class problems with softmax loss multiclass SVM solution (regardless of init, stepsize) Steepest Descent w.r.t. w arg min w s. t. i y i w, x i 1 (regardless of init, stepsize) Coordinate Descent arg min w 1 s. t. i y i w, x i 1 (regardless of init, stepsize) Matrix factorization problems L U, V = σ i l A i, UV, including 1-bit matrix completion arg min W tr s. t. A i, W 1 (regardless of init)

Linear Neural Networks Graph G(V, E), with h v = σ u v w u v h u Input units h in = x i R n, single output h out (x i ), binary label y i ±1 σ i=1 m Training: min l y i h out x i w Implements linear predictor: h out x i = P w, x i Training: min L P w = l y i P w, x i w i=1 Just a different parametrization of linear classification: min L(β) β Im P GD on w: different optimization procedure for same argmin problem Limit of GD: β = lim t P w t P w t m Im P = R n in all our examples

Fully Connected Linear NNs L fully connected layers with D l 1 units in layer l h l D l, h 0 = h in h l = W T l h l 1 h out = h L parameters: w = W l R D l D l 1, l = 1.. L Theorem: β arg min β 2 s. t. i y i β, x i 1 for l z = exp( z), almost all linearly separable data sets and initializations w(0) and any bounded stepsizes s.t. L w t 0 and Δw t = w t w t 1 converges in direction

Linear Conv Nets L-1 hidden layers, h l R n, each with full-width cyclic convolution : D 1 h l d = k=0 w l k h l 1 [d + k mod D] Params: w = w l R D, l = 1.. L h out = w L, h L 1 Theorem: With single conv layer (L=2), β arg min Fβ 1 s. t. i y i β, x i 1 Theorem: β critical point of min Fβ F=discrete Fourier transform 2 ΤL s. t. i y i β, x i 1 for l z = exp( z), almost all linearly separable data sets and initializations w(0) and any bounded stepsizes s.t. L 0, and Δw(t) converge in direction

min β 2 s. t. i y i β, x i 1 L = 2 min Fβ 2 s. t. i y i β, x i 1 L L = 5 min β 2 s. t. i y i β, x i 1 L L = 5

Goal: understand optimization algorithms not just as reaching some (global) optimum, but as reaching a specific optimum Different optimization algorithm Different bias in optimum reached Different inductive bias Different learning properties