Numerical Computation for Deep Learning

Similar documents
Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Day 3 Lecture 3. Optimizing deep networks

Neural Network Training

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

CSC321 Lecture 4: Learning a Classifier

Lecture 35: Optimization and Neural Nets

Advanced computational methods X Selected Topics: SGD

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to gradient descent

OPTIMIZATION METHODS IN DEEP LEARNING

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Computational statistics

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Computing Neural Network Gradients

What Every Programmer Should Know About Floating-Point Arithmetic DRAFT. Last updated: November 3, Abstract

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Gradient Descent. Dr. Xiaowei Huang

Training Neural Networks Practical Issues

CSC321 Lecture 8: Optimization

Linear Models for Classification

Stochastic gradient descent; Classification

Lecture 5: Logistic Regression. Neural Networks

HOMEWORK #4: LOGISTIC REGRESSION

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Background Mathematics (2/2) 1. David Barber

CSC321 Lecture 4: Learning a Classifier

Linear & nonlinear classifiers

Neural Networks in Structured Prediction. November 17, 2015

Lecture 4: Training a Classifier

Optimization for neural networks

Logistic Regression. COMP 527 Danushka Bollegala

CSC321 Lecture 7: Optimization

STA 4273H: Statistical Machine Learning

Expectation Maximization Algorithm

Artificial Neural Networks. MGS Lecture 2

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Algorithms for Learning Good Step Sizes

Deep Neural Networks (1) Hidden layers; Back-propagation

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

CSC321 Lecture 9: Generalization

CS60010: Deep Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Based on the original slides of Hung-yi Lee

Machine Learning Lecture 5

Compute the behavior of reality even if it is impossible to observe the processes (for example a black hole in astrophysics).

Math 411 Preliminaries

Convolutional Neural Networks

Binary floating point

Reading Group on Deep Learning Session 1

This ensures that we walk downhill. For fixed λ not even this may be the case.

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

1 Backward and Forward Error

Mathematical Methods in Engineering and Science Prof. Bhaskar Dasgupta Department of Mechanical Engineering Indian Institute of Technology, Kanpur

Deep Feedforward Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Week 5: Logistic Regression & Neural Networks

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

1 Number Systems and Errors 1

Conjugate gradient algorithm for training neural networks

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

1 Kernel methods & optimization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture Notes: Geometric Considerations in Unconstrained Optimization

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Lecture 15: Exploding and Vanishing Gradients

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Notes for Chapter 1 of. Scientific Computing with Case Studies

Order of convergence. MA3232 Numerical Analysis Week 3 Jack Carl Kiefer ( ) Question: How fast does x n

ECS 231 Computer Arithmetic 1 / 27

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear & nonlinear classifiers

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Applied Mathematics 205. Unit 0: Overview of Scientific Computing. Lecturer: Dr. David Knezevic

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

CSC321 Lecture 9: Generalization

Neural networks and support vector machines

text classification 3: neural networks

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

Machine Learning Basics

Arithmetic and Error. How does error arise? How does error arise? Notes for Part 1 of CMSC 460

Deep Neural Networks (1) Hidden layers; Back-propagation

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Support Vector Machines: Maximum Margin Classifiers

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Nonlinear Optimization Methods for Machine Learning

Transcription:

Numerical Computation for Deep Learning Lecture slides for Chapter 4 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last modified 2017-10-14 Thanks to Justin Gilmer and Jacob Buckman for helpful discussions

Numerical concerns for implementations of deep learning algorithms Algorithms are often specified in terms of real numbers; real numbers cannot be implemented in a finite computer Does the algorithm still work when implemented with a finite number of bits? Do small changes in the input to a function cause large changes to an output? Rounding errors, noise, measurement errors can cause large changes Iterative search for best input is difficult

Roadmap Iterative Optimization Rounding error, underflow, overflow

Iterative Optimization Gradient descent Curvature Constrained optimization

Gradient Descent 2.0 1.5 1.0 Global minimum at x = 0. Since f 0 (x) = 0, gradient descent halts here. 0.5 0.0 0.5 For x<0, we have f 0 (x) < 0, so we can decrease f by moving rightward. For x>0, we have f 0 (x) > 0, so we can decrease f by moving leftward. 1.0 1.5 2.0 f(x) = 1 2 x2 f 0 (x) =x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1: An illustration of how the gradient descent algorithm uses the deriv Figure 4.1

Approximate Optimization f(x) Ideally, we would like to arrive at the global minimum, but this might not be possible. This local minimum performs nearly as well as the global one, so it is an acceptable halting point. This local minimum performs poorly and should be avoided. x Figure 4.3

We usually don t even reach a local minimum 16 1.0 Gradient norm 14 12 10 8 6 4 2 0 Classification error rate 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 2 50 0 50 100 150 200 250 0.1 0 50 100 150 200 250 Training time (epochs) Training time (epochs)

Deep learning optimization Pure math way of life: way of life Find literally the smallest value of f(x) Or maybe: find some critical point of f(x) where the value is locally smallest Deep learning way of life: Decrease the value of f(x) a lot

Iterative Optimization Gradient descent Curvature Constrained optimization

Critical Points Minimum Maximum Saddle point Figure 4.2

Saddle Points 500 0 500 f(x 1,x 1 ) 15 x 1 0 15 15 0 x 1 15 Saddle points attract Newton s method Figure 4.5 (Gradient descent escapes, see Appendix C of Qualitatively Characterizing Neural Network Optimization Problems )

Curvature Negative curvature No curvature Positive curvature f(x) f(x) f(x) x x Figure 4.4 x

Directional Second Derivatives 3 Before multiplication 3 After multiplication 2 2 1 v (1) 1 v (1) 1 v (1) x 1 0 1 v (2) x 0 1 0 1 2 v (2) v (2) 2 2 3 3 2 1 0 1 2 3 x 0 3 3 2 1 0 1 2 3 x0 0

Predicting optimal step size using Taylor series f(x (0) g) f(x (0) ) g > g + 1 2 2 g > Hg. (4.9) e terms here: the original value of the function, the expected = g> g g > Hg. (4.10) s with the eigenvector of Big gradients speed you up corresponding to the Big eigenvalues slow you down if you align with their eigenvectors

Condition Number max i,j i j. (4.2) When the condition number is large, sometimes you hit large eigenvalues and sometimes you hit small ones. The large ones force you to keep the learning rate small, and miss out on moving fast in the small eigenvalue directions.

Gradient Descent and Poor Conditioning 20 10 x2 0 10 20 30 30 20 10 0 10 20 x 1 Figure 4.6

Neural net visualization At end of learning: - gradient is still large - curvature is huge (From Qualitatively Characterizing Neural Network Optimization Problems )

Iterative Optimization Gradient descent Curvature Constrained optimization

KKT Multipliers min x max max, 0 f(x)+ X i ig (i) (x)+ X j j h (j) (x). (4.19) In this book, mostly used for theory X (e.g.: show Gaussian is highest entropy distribution) In practice, we usually X just project back to the constraint region after each step

Roadmap Iterative Optimization Rounding error, underflow, overflow

Numerical Precision: A deep learning super skill Often deep learning algorithms sort of work Loss goes down, accuracy gets within a few percentage points of state-of-the-art No bugs per se Often deep learning algorithms explode (NaNs, large values) Culprit is often loss of numerical precision

Rounding and truncation errors In a digital computer, we use float32 or similar schemes to represent real numbers A real number x is rounded to x + delta for some small delta Overflow: large x replaced by inf Underflow: small x replaced by 0

Example Adding a very small number to a larger one may have no effect. This can cause large changes downstream: >>> a = np.array([0., 1e-8]).astype('float32') >>> a.argmax() 1 >>> (a + 1).argmax() 0

Secondary effects Suppose we have code that computes x-y Suppose x overflows to inf Suppose y overflows to inf Then x - y = inf - inf = NaN

exp exp(x) overflows for large x Doesn t need to be very large float32: 89 overflows Never use large x exp(x) underflows for very negative x Possibly not a problem Possibly catastrophic if exp(x) is a denominator, an argument to a logarithm, etc.

Subtraction Suppose x and y have similar magnitude Suppose x is always greater than y In a computer, x - y may be negative due to rounding error Example: variance Var(f(x)) = E h (f(x) E[f(x)]) 2i. (3.12) = E f(x) 2 E [f(x)] 2 Safe Dangerous

log and sqrt log(0) = - inf log(<negative>) is imaginary, usually nan in software sqrt(0) is 0, but its derivative has a divide by zero Definitely avoid underflow or round-to-negative in the argument! Common case: standard_dev = sqrt(variance)

log exp log exp(x) is a common pattern Should be simplified to x Avoids: Overflow in exp Underflow in exp causing -inf in log

Which is the better hack? normalized_x = x / st_dev eps = 1e-7 Should we use st_dev = sqrt(eps + variance) st_dev = eps + sqrt(variance)? What if variance is implemented safely and will never round to negative?

log(sum(exp)) Naive implementation: tf.log(tf.reduce_sum(tf.exp(array)) Failure modes: If any entry is very large, exp overflows If all entries are very negative, all exps underflow and then log is -inf

Stable version mx = tf.reduce_max(array) safe_array = array - mx log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array)) Built in version: tf.reduce_logsumexp

Why does the logsumexp trick work? Algebraically equivalent to the original version: m + log X i exp(a i m) exp(a i ) X = m + log exp(m) i 1 X = m + log exp(a i ) exp(m) i = m log exp(m) + log X i exp(a i )

Why does the logsumexp trick work? No overflow: Entries of safe_array are at most 0 Some of the exp terms underflow, but not all At least one entry of safe_array is 0 The sum of exp terms is at least 1 The sum is now safe to pass to the log

Softmax Softmax: use your library s built-in softmax function If you build your own, use: safe_logits = logits - tf.reduce_max(logits) softmax = tf.nn.softmax(safe_logits) Similar to logsumexp

Sigmoid Use your library s built-in sigmoid function If you build your own: Recall that sigmoid is just softmax with one of the logits hard-coded to 0

Cross-entropy Cross-entropy loss for softmax (and sigmoid) has both softmax and logsumexp in it Compute it using the logits not the probabilities The probabilities lose gradient due to rounding error where the softmax saturates Use tf.nn.softmax_cross_entropy_with_logits or similar If you roll your own, use the stabilization tricks for softmax and logsumexp

Bug hunting strategies If you increase your learning rate and the loss gets stuck, you are probably rounding your gradient to zero somewhere: maybe computing cross-entropy using probabilities instead of logits For correctly implemented loss, too high of learning rate should usually cause explosion

Bug hunting strategies If you see explosion (NaNs, very large values) immediately suspect: log exp sqrt division Always suspect the code that changed most recently

Questions