Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04

Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

XOR is not linearly separable 1 Original x space x2 0 0 1 x 1 Figure 6.1, left

Rectified Linear Activation g(z) = max{0,z} 0 0 z Figure 6.3

DEEP FEEDFORWARD NETWORKS Network Diagrams y y w h1 h2 h W x1 x2 x Figure 6.2 n example of a feedforward network, drawn in two diﬀerent style

Solving XOR f(x; W, c, w,b)=w > max{0, W > x + c} + b. (6.3) W = apple apple 1 1 1 1, (6.4) c = w = apple 0 1 apple 1 2, (6.5), (6.6)

Solving XOR Original x space Learned h space 1 1 x2 h2 0 0 0 1 x 1 Figure 6.1 0 1 2 h 1

Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

Gradient-Based Learning Specify Model Cost Design model and cost so cost is smooth Minimize cost using gradient descent or related techniques

Conditional Distributions and Cross-Entropy J( ) = E x,y ˆpdata log p model (y x). (6.12)

Output Types Output Type Output Distribution Output Layer Binary Bernoulli Sigmoid Discrete Multinoulli Softmax Continuous Gaussian Linear Continuous Mixture of Gaussian Mixture Density Cost Function Binary crossentropy Discrete crossentropy Gaussian crossentropy (MSE) Cross-entropy Continuous Arbitrary See part III: GAN, VAE, FVBN Various

Mixture Density Outputs y x Figure 6.4

Don t mix and match Sigmoid output with target of 1 (z) Cross-entropy loss MSE loss 1.0 0.5 0.0 3 2 1 0 1 2 3 z

Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

Hidden units Use ReLUs, 90% of the time For RNNs, see Chapter 10 For some research projects, get creative Many hidden units perform comparably to ReLUs. New hidden units that perform comparably are rarely interesting.

Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

Architecture Basics y Depth h 1 x 1 h 2 x 2 Width

Universal Approximator Theorem One hidden layer is enough to represent (not learn) an approximation of any function to an arbitrary degree of accuracy So why deeper? Shallow net may need (exponentially) more width Shallow net may overfit more

Exponential Representation Advantage of Depth Figure 6.5

Better Generalization with Greater Depth 96.5 96.0 Test accuracy (percent) 95.5 95.0 94.5 94.0 93.5 93.0 92.5 92.0 3 4 5 6 7 8 9 10 11 Layers Figure 6.6

Large, Shallow Models Overfit More Test accuracy (percent) 97 96 95 94 93 92 3, convolutional 3, fully connected 11, convolutional 91 0.0 0.2 0.4 0.6 0.8 1.0 Number of parameters 10 8 Figure 6.7

Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

Back-Propagation Back-propagation is just the chain rule of calculus dz dx = dz dy X But it s a particular implementation of the chain rule Uses dynamic programming (table filling) Avoids recomputing repeated subexpressions Speed vs memory tradeoff dy dx. (6.44) 2 R 2 R @y > r x z = r y z, (6.46) @x

Simple Back-Prop Example Compute loss y Compute activations Forward prop h 1 x 1 h 2 x 2 Back-prop Compute derivatives

CHAPTER 6. DEEP FEEDFORWARD NETWORKS Computation Graphs y Multiplication Logistic regression z u(1) u(2) + dot y x x w (a) (b) u(2) H ReLU layer U relu (1) U (2) y u dot matmul W (c) b Figure 6.8 x u(3) sum + X b (1) sqr Linear regression and weight decay w (d) Figure 6.8: Examples of computational graphs. (a)the graph using the operation to compute z = xy. (b)the graph for the logistic regression prediction y = x> w + b.

Repeated Subexpressions z f f f y x @z @w = @z @y @y @x @x @w (6.50) (6.51) =f 0 (y)f 0 (x)f 0 (w) (6.52) =f 0 (f(f(w)))f 0 (f(w))f 0 (w) (6.53) w Figure 6.9 Back-prop avoids computing this twice

Symbol-to-Symbol Differentiation f z f z Figure 6.10 y y f 0 dz dy f f x x f 0 dy dx dz dx f f w w f 0 dx dw dz dw

APTER 6. DEEP FEEDFORWARD NETWORKS Neural Network Loss Function JMLE J cross_entropy U (2) + y u(8) matmul H W (2) relu sqr U (5) sum u(6) u(7) + U (1) matmul Figure 6.11 X W (1) sqr U (3) sum u(4) ure 6.11: The computational graph used to compute the cost used to train our example

Hessian-vector Products h(r Hv = r x x f(x)) > i v. (6.59)

Questions