Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Similar documents
Continuous NN. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

A summary of Deep Learning without Poor Local Minima

Nonlinear Optimization Methods for Machine Learning

Neural Networks and Deep Learning

Lecture 5: Logistic Regression. Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Neural Network Training

ECS171: Machine Learning

Lecture 2: Learning with neural networks

Computational statistics

Introduction to Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC321 Lecture 5: Multilayer Perceptrons

CSCI567 Machine Learning (Fall 2018)

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Day 3 Lecture 3. Optimizing deep networks

Advanced Machine Learning

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks (Part 1) Goals for the lecture

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 17: Neural Networks and Deep Learning

Lecture 5 Neural models for NLP

Machine Learning

Presentation in Convex Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Course 395: Machine Learning - Lectures

Neural networks and support vector machines

Why should you care about the solution strategies?

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Neural networks COMS 4771

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

ECE521 Lecture 7/8. Logistic Regression

Reading Group on Deep Learning Session 1

Comparison of Modern Stochastic Optimization Algorithms

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Deep Feedforward Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Neural Networks (1) Hidden layers; Back-propagation

Understanding Neural Networks : Part I

ECE521 Lectures 9 Fully Connected Neural Networks

CSC242: Intro to AI. Lecture 21

Artificial Neural Networks. MGS Lecture 2

Jakub Hajic Artificial Intelligence Seminar I

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Introduction to Machine Learning Spring 2018 Note Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CSC321 Lecture 8: Optimization

Statistical Machine Learning

Unit III. A Survey of Neural Network Model

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

1 What a Neural Network Computes

Stochastic Optimization Algorithms Beyond SG

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Pattern Classification

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Deep Feedforward Networks. Sargur N. Srihari

Artificial Neural Networks

Bits of Machine Learning Part 1: Supervised Learning

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

More Optimization. Optimization Methods. Methods

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

A Parallel SGD method with Strong Convergence

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Lecture 2: Modular Learning

Nonparametric regression using deep neural networks with ReLU activation function

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Convolutional Neural Network Architecture

ECE521 lecture 4: 19 January Optimization, MLE, regularization

OPTIMIZATION METHODS IN DEEP LEARNING

Neural networks and optimization

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

CSC321 Lecture 4: Learning a Classifier

Deep Neural Networks (1) Hidden layers; Back-propagation

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Deep Learning Lecture 2

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Deep Learning for NLP

Based on the original slides of Hung-yi Lee

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Artificial Neural Networks

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Distributed Newton Methods for Deep Neural Networks

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

CSC321 Lecture 2: Linear Regression

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

An overview of deep learning methods for genomics

Transcription:

Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1

Course Overview Intro 2

Course Overview Lecture 1: Linear Models 1. Introduction and Applications 2. Linear Models: Least-Squares and Logistic Regression Lecture 2: Neural Networks 1. Introduction to Nonlinear Models 2. Single Layer Neural Networks 3. Training Algorithms for Single Layer Neural Networks 4. Neural Networks and Residual Neural Networks (ResNets) Lecture 3: Neural Networks as Differential Equations 1. Resnets as ODEs 2. Residual CNN and their relation to PDEs Intro 3

Introducion Intro 4

Intro 5 Motivation: Nonlinear Models In general, impossible to find a linear separator between classes input features transformed features Goal/Trick Embed the points in higher dimension and/or move the points to make them linearly separable

Intro 6 Example: Linear Fitting Assume C R nc n, Y R n f n and n n f. Goal: Find W R nc n f such that C = WY If rank(y) < n, may not be possible to fit the data. Two options: 1. Regression: Solve min W WY C 2 F always has solutions, but residual might be large 2. Nonlinear Model: Replace Y by σ(ky), where σ is element-wise function (aka activation) and K R m n f where m n f

Illustrating Nonlinear Models transformed original Remarks instead of WY = C solve Ŵσ(KY) = C solve bigger problem memory, computation,... what happens to rank(σ(ky)) when σ(x) = x? Intro 7

Intro 8 Universal Approximation Theorem Given the data Y R n f n and C R nc n with n n f There is nonlinear function σ : R R, a matrix K R m n f, and a bias b R such that rank(σ(ky + b)) = n. Therefore, possible [2, 7] to find W R nc m Wσ(KY + b) = C

Intro 9 Choosing Nonlinear Model Wσ(KY + b) = C how to choose σ? early days: motivated by neurons popular choice: σ(x) = tanh(x) (smooth, bounded,... ) nowadays: σ(x) = max(x, 0) (aka ReLU, rectified linear unit, non-differentiable, not bounded, simple) how to choose K and b? pick randomly branded as extreme learning machines [8] train (optimize) deep learning (when we have multiple layers)

First Experiment: Random Transformation Select activation function and choose K and b randomly and solve the least-squares/classification problem The Pros: universal approximation theorem: can interpolate any function very easy to program can serve as a benchmark to more sophisticated methods Some concerns: may require very large K (size of the data) may not generalize well large dense linear algebra EELM Peaks.m Intro 10

Single Layer Neural Networks Intro 11

Learning the Weights Assume that the number of examples, n, is very large. Using random weights, K might need to be very large to fit training data. Solution may not generalize well to test data. Idea: Learn K and b from the data (in addition to W) min E(Wσ(KY + b), K,W,b Cobs ) + λr(w, K, b) About this optimization problem: more unknowns K R m n f, W R n c m, b R non-convex problem local minima, careful initialization need to compute derivatives w.r.t. K, b Intro 12

Intro 13 Non-Convexity The optimization problem is non-convex. Simple illustration of cross-entropy along two random directions dk and dw (see ESingleLayer PlotObjective) Expect worse when number of layers grows!

Optimization for Single Layer Neural Networks Intro 14

Gauss-Newton Method Goal: Use curvature information for fast convergence K E(K, b, W) = (J K Z) Z E(Wσ(KY + b), C), where J K Z = K σ(ky + b). This means that Hessian is 2 KE(K) = (J K Z) 2 ZE(C, Z, W)J K Z n m + 2 Kσ(KY + b) ij Z E(C, Z, W) ij i=1 j=1 First term is spsd and we can compute it. We neglect second term since can be indefinite and difficult to compute small if transformation is roughly linear or close to solution (easy to see for least-squares) do the same for b and use full Hessian for W ignore coupling! Intro 15

Intro 16 Variable Projection - 1 Idea: Treat learning problem as coupled optimization problem with blocks θ = (K, b) and W. Simple illustration for coupled least-squares problem [4, 3, 10] min θ,w φ(θ, w) = 1 2 A(θ)w c 2 + λ 2 Lw 2 + β 2 Mθ 2 Note that for given θ the problem becomes a standard least-squares problem. Define: w(θ) = ( A(θ) A(θ) + λl L ) 1 A(θ) c This gives optimization problem in θ only (aka reduced/projected problem) min θ φ(θ) = 1 2 A(θ)w(θ) c 2 + λ 2 Lw(θ) 2 + β 2 Mθ 2

Variable Projection - 2 min θ φ(θ) = 1 2 A(θ)w(θ) c 2 + λ 2 Lw(θ) 2 + β 2 Mθ 2 Optimality condition: φ(θ) = θ φ(θ, w) + θ w(θ) w φ(θ, w)! = 0. Less complicated than it seems since w φ(θ, w(θ)) = A(θ) (A(θ)w(θ) c) + λl Lw(θ) = 0 Conclusion: ignore second term in gradient computation apply steepest descent or Gauss-Newton to minimize φ need to solve least-squares problem in each evaluation of objective gradient is only correct if LS problem is solved exactly Intro 17

Variable Projection for Single Layer min E(Wσ(K(θ)Y), C) + λr(θ, W) θ,w Assume that the regularizer is separable, i.e., R(θ, W) = R 1 (θ) + R 2 (W) and that R 2 is convex and smooth. Hence, the projection requires solving the regularized classification problem W(θ) = arg min E(Wσ(K(θ)Y), C) + λr 2 (W) W practical considerations: solve for W(θ) using Newton (need accuracy) need good solver to approximate gradient w.r.t. θ well use Gauss-Newton or steepest descent to solve for θ Intro 18

Intro 19 Stochastic Optimization Assume that each y i, c i pair is drawn from some (unknown probability distribution). Then, we can interpret the learning problem as minimizing the expected value of the cross entropy, e.g., in linear regression ( ) 1 E(W) = E 2 y W c 2 This is a stochastic optimization problem [1]. One idea: Stochastic Approximation: Design iteration W k W so that expected value decreases. Example: Stochastic Gradient Descent, ADAM,... Pro: sample can be small (mini batch) Con: how to monitor objective, linesearch, descent,...

Intro 20 Stochastic Average Approximation Alternative way to solve stochastic optimization problem ( ) 1 E(W) = E 2 y W c 2 Pick relatively large sample S {1,..., n} and use deterministic optimization method to solve min W 1 2 S y s W c s 2. s S Pro: use your favorite solver, linesearch, stopping... Con: large batches needed Note: Sample stays fixed during iteration.

Experiment: Peaks Compare the three approaches for training a single layer neural network ESingleLayer PeaksSGD.m - stochastic gradient descent ESingleLayer PeaksNewtonCG.m - Newton CG with block-diagonal Hessian approximation ESingleLayer PeaksVarPro.m - Fully coupled solver. Eliminate θ and use steepest descent/newton CG for reduced problem. Intro 21

Deep Neural Networks Intro 22

Why Deep Networks? Universal approximation theorem of NN suggests that we can approximate any function by two layers. But - The width of the layer can be very large O(n n f ) Deeper architectures can lead to more efficient descriptions of the problem. No real proof but lots of practical experience. Intro 23

Deep Neural Networks How deep is deep? We will answer this question later... Until recently, the standard architecture was Y 1 = σ(k 0 Y 0 + b 0 ). =. Y N = σ(k N 1 Y N 1 + b N 1 ) And use Y N to classify. This leads to the optimization problem min K 0,...,N 1,b 0,...,N 1,W E ( WY N (K 1,..., K N 1, b 1,..., b N 1 ), C obs) Intro 24

Deep Neural Networks in Practice (Some) challenges: Computational costs (architecture have millions or billions of parameters) difficult to design difficult to train (exploding/vanishing gradients) unpredictable performance In 2015, He et al. [5, 6] came with a new architecture that solves many of the problems Intro 25

Residual Neural Networks Intro 26

Simplified Residual Neural Network Residual Network Y 1 = Y 0 + σ(k 0 Y 0 + b 0 ). =. Y N = Y N 1 + σ(k N 1 Y N 1 + b N 1 ) And use Y N to classify. This leads to the optimization problem min K 0,...,N 1,b 0,...,N 1,W E ( WY N (K 1,..., K N 1, b 1,..., b N 1 ), C obs) Leads to smoother objective function [9]. Intro 27

References [1] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. arxiv preprint [stat.ml] (1606.04838v1), 2016. [2] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303 314, 1989. [3] G. Golub and V. Pereyra. Separable nonlinear least squares: the variable projection method and its applications. Inverse Problems, 19:R1 R26, 2003. [4] G. H. Golub and V. Pereyra. The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate. SIAM Journal on Numerical Analysis, 10(2):413 432, 1973. [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. [6] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630 645. Springer, 2016. [7] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989. [8] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: Theory and applications. Neurocomputing, 70(1-3):489 501, Dec. 2006. [9] H. Li, Z. Xu, G. Taylor, and T. Goldstein. Visualizing the Loss Landscape of Neural Nets. 2017. Intro 28

Intro 29 References (cont.) [10] D. P. O Leary and B. W. Rust. Variable projection for nonlinear least squares problems. Computational Optimization and Applications. An International Journal, 54(3):579 593, 2013.