10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1
Neural Networks Outline Logistic Regression (Recap) Data, Model, Learning, Prediction Neural Networks A Recipe for Machine Learning Visual Notation for Neural Networks Example: Logistic Regression Output Surface 2-Layer Neural Network 3-Layer Neural Network Neural Net Architectures Objective Functions Activation Functions Backpropagation Basic Chain Rule (of calculus) Chain Rule for Arbitrary Computation Graph Backpropagation Algorithm Module-based Automatic Differentiation (Autodiff) Last Lecture This Lecture 2
ARCHITECTURES 3
Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function 4
Building a Neural Net Q: How many hidden units, D, should we use? Output Features 5
Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer D = M 1 1 1 Input 6
Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer D = M Input 7
Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer D = M Input 8
Building a Neural Net Q: How many hidden units, D, should we use? Output What method(s) is this setting similar to? Hidden Layer D < M Input 9
Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer D > M What method(s) is this setting similar to? Input 10
Decision Functions Deeper Networks Q: How many layers should we use? Output Hidden Layer 1 Input 11
Decision Functions Deeper Networks Q: How many layers should we use? Output Hidden Layer 2 Hidden Layer 1 Input 12
Decision Functions Deeper Networks Q: How many layers should we use? Output Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Input 13
Decision Functions Deeper Networks Q: How many layers should we use? Theoretical answer: A neural network with 1 hidden layer is a universal function approximator Cybenko (1989): For any continuous function g(x), there exists a 1-hidden-layer neural net h θ (x) s.t. h θ (x) g(x) < ϵ for all x, assuming sigmoid activation functions Output Empirical answer: Before 2006: Deep networks (e.g. 3 or more hidden layers) Hidden Layer 1 are too hard to train After 2006: Deep networks are easier to train than shallow networks (e.g. 2 or fewer layers) for many problems Input Big caveat: You need to know and use the right tricks. 14
Decision Functions Different Levels of Abstraction We don t know the right levels of abstraction So let the model figure it out! Example from Honglak Lee (NIPS 2010) 18
Decision Functions Different Levels of Abstraction Face Recognition: Deep Network can build up increasingly higher levels of abstraction Lines, parts, regions Example from Honglak Lee (NIPS 2010) 19
Decision Functions Different Levels of Abstraction Output Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Input Example from Honglak Lee (NIPS 2010) 20
Activation Functions Neural Network with sigmoid activation functions (F) Loss J = 1 2 (y y )2 Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 21
Activation Functions Neural Network with arbitrary nonlinear activation functions (F) Loss J = 1 2 (y y )2 Output (E) Output (nonlinear) y = (b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (nonlinear) z j = (a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 22
Activation Functions Sigmoid / Logistic Function 1 logistic(u) 1+ e u So far, we ve assumed that the activation function (nonlinearity) is always the sigmoid function 23
Activation Functions A new change: modifying the nonlinearity The logistic is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [-1, +1] Slide from William Cohen
AI Stats 2010 depth 4? sigmoid vs. tanh Figure from Glorot & Bentio (2010)
Activation Functions A new change: modifying the nonlinearity relu often used in vision tasks Alternate 2: rectified linear unit Linear with a cutoff at zero (Implementation: clip the gradient when you pass zero) Slide from William Cohen
Activation Functions A new change: modifying the nonlinearity relu often used in vision tasks Alternate 2: rectified linear unit Soft version: log(exp(x)+1) Doesn t saturate (at one end) Sparsifies outputs Helps with vanishing gradient Slide from William Cohen
Objective Functions for NNs 1. Quadratic Loss: the same objective as Linear Regression i.e. mean squared error 2. Cross-Entropy: the same objective as Logistic Regression i.e. negative log likelihood This requires probabilities, so we add an additional softmax layer at the end of our network Forward Backward Quadratic J = 1 2 (y y )2 dj dy = y Cross Entropy J = y (y)+(1 y ) (1 y) y dj dy = y 1 y +(1 y ) 1 y 1 28
Objective Functions for NNs Cross-entropy vs. Quadratic loss Figure from Glorot & Bentio (2010)
Multi-Class Output Output Hidden Layer Input 30
Multi-Class Output Softmax: y k = Output Hidden Layer (b k ) K l=1 (b l) (F) Loss J = K k=1 y k (y k) (E) Output (softmax) y k = (b k) K l=1 (b l) (D) Output (linear) b k = D j=0 kjz j k (C) Hidden (nonlinear) z j = (a j ), j (B) Hidden (linear) a j = M i=0 jix i, j Input (A) Input Given x i, i 31
BACKPROPAGATION 32
Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 33
Training Approaches to Differentiation Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? Question 2: When can we make the gradient computation efficient? 34
Training Approaches to Differentiation 1. Finite Difference Method Pro: Great for testing implementations of backpropagation Con: Slow for high dimensional inputs / outputs Required: Ability to call the function f(x) on any input x 2. Symbolic Differentiation Note: The method you learned in high-school Note: Used by Mathematica / Wolfram Alpha / Maple Pro: Yields easily interpretable derivatives Con: Leads to exponential computation time if not carefully implemented Required: Mathematical expression that defines f(x) 3. Automatic Differentiation - Reverse Mode Note: Called Backpropagation when applied to Neural Nets Pro: Computes partial derivatives of one output f(x) i with respect to all inputs x j in time proportional to computation of f(x) Con: Slow for high dimensional outputs (e.g. vector-valued functions) Required: Algorithm for computing f(x) 4. Automatic Differentiation - Forward Mode Note: Easy to implement. Uses dual numbers. Pro: Computes partial derivatives of all outputs f(x) i with respect to one input x j in time proportional to computation of f(x) Con: Slow for high dimensional inputs (e.g. vector-valued x) Required: Algorithm for computing f(x) 35
Training Finite Difference Method Notes: Suffers from issues of floating point precision, in practice Typically only appropriate to use on small examples with an appropriately chosen epsilon 36
Training Symbolic Differentiation Chain Rule Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 37
Training Symbolic Differentiation Calculus Quiz #2: 39
Training Chain Rule Whiteboard Chain Rule of Calculus 40
Training { That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k 41
Training { That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k Backpropagation is just repeated application of the chain rule from Calculus 101. 42