Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04
Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation
XOR is not linearly separable 1 Original x space x2 0 0 1 x 1 Figure 6.1, left
Rectified Linear Activation g(z) = max{0,z} 0 0 z Figure 6.3
DEEP FEEDFORWARD NETWORKS Network Diagrams y y w h1 h2 h W x1 x2 x Figure 6.2 n example of a feedforward network, drawn in two different style
Solving XOR f(x; W, c, w,b)=w > max{0, W > x + c} + b. (6.3) W = apple apple 1 1 1 1, (6.4) c = w = apple 0 1 apple 1 2, (6.5), (6.6)
Solving XOR Original x space Learned h space 1 1 x2 h2 0 0 0 1 x 1 Figure 6.1 0 1 2 h 1
Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation
Gradient-Based Learning Specify Model Cost Design model and cost so cost is smooth Minimize cost using gradient descent or related techniques
Conditional Distributions and Cross-Entropy J( ) = E x,y ˆpdata log p model (y x). (6.12)
Output Types Output Type Output Distribution Output Layer Binary Bernoulli Sigmoid Discrete Multinoulli Softmax Continuous Gaussian Linear Continuous Mixture of Gaussian Mixture Density Cost Function Binary crossentropy Discrete crossentropy Gaussian crossentropy (MSE) Cross-entropy Continuous Arbitrary See part III: GAN, VAE, FVBN Various
Mixture Density Outputs y x Figure 6.4
Don t mix and match Sigmoid output with target of 1 (z) Cross-entropy loss MSE loss 1.0 0.5 0.0 3 2 1 0 1 2 3 z
Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation
Hidden units Use ReLUs, 90% of the time For RNNs, see Chapter 10 For some research projects, get creative Many hidden units perform comparably to ReLUs. New hidden units that perform comparably are rarely interesting.
Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation
Architecture Basics y Depth h 1 x 1 h 2 x 2 Width
Universal Approximator Theorem One hidden layer is enough to represent (not learn) an approximation of any function to an arbitrary degree of accuracy So why deeper? Shallow net may need (exponentially) more width Shallow net may overfit more
Exponential Representation Advantage of Depth Figure 6.5
Better Generalization with Greater Depth 96.5 96.0 Test accuracy (percent) 95.5 95.0 94.5 94.0 93.5 93.0 92.5 92.0 3 4 5 6 7 8 9 10 11 Layers Figure 6.6
Large, Shallow Models Overfit More Test accuracy (percent) 97 96 95 94 93 92 3, convolutional 3, fully connected 11, convolutional 91 0.0 0.2 0.4 0.6 0.8 1.0 Number of parameters 10 8 Figure 6.7
Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation
Back-Propagation Back-propagation is just the chain rule of calculus dz dx = dz dy X But it s a particular implementation of the chain rule Uses dynamic programming (table filling) Avoids recomputing repeated subexpressions Speed vs memory tradeoff dy dx. (6.44) 2 R 2 R @y > r x z = r y z, (6.46) @x
Simple Back-Prop Example Compute loss y Compute activations Forward prop h 1 x 1 h 2 x 2 Back-prop Compute derivatives
CHAPTER 6. DEEP FEEDFORWARD NETWORKS Computation Graphs y Multiplication Logistic regression z u(1) u(2) + dot y x x w (a) (b) u(2) H ReLU layer U relu (1) U (2) y u dot matmul W (c) b Figure 6.8 x u(3) sum + X b (1) sqr Linear regression and weight decay w (d) Figure 6.8: Examples of computational graphs. (a)the graph using the operation to compute z = xy. (b)the graph for the logistic regression prediction y = x> w + b.
Repeated Subexpressions z f f f y x @z @w = @z @y @y @x @x @w (6.50) (6.51) =f 0 (y)f 0 (x)f 0 (w) (6.52) =f 0 (f(f(w)))f 0 (f(w))f 0 (w) (6.53) w Figure 6.9 Back-prop avoids computing this twice
Symbol-to-Symbol Differentiation f z f z Figure 6.10 y y f 0 dz dy f f x x f 0 dy dx dz dx f f w w f 0 dx dw dz dw
APTER 6. DEEP FEEDFORWARD NETWORKS Neural Network Loss Function JMLE J cross_entropy U (2) + y u(8) matmul H W (2) relu sqr U (5) sum u(6) u(7) + U (1) matmul Figure 6.11 X W (1) sqr U (3) sum u(4) ure 6.11: The computational graph used to compute the cost used to train our example
Hessian-vector Products h(r Hv = r x x f(x)) > i v. (6.59)
Questions