Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Size: px

Start display at page:

Download "Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated"

Julianna Rogers
6 years ago
Views:

1 Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

2 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

3 XOR is not linearly separable 1 Original x space x x 1 Figure 6.1, left

4 Rectified Linear Activation g(z) = max{0,z} 0 0 z Figure 6.3

5 DEEP FEEDFORWARD NETWORKS Network Diagrams y y w h1 h2 h W x1 x2 x Figure 6.2 n example of a feedforward network, drawn in two diﬀerent style

6 Solving XOR f(x; W, c, w,b)=w > max{0, W > x + c} + b. (6.3) W = apple apple , (6.4) c = w = apple 0 1 apple 1 2, (6.5), (6.6)

7 Solving XOR Original x space Learned h space 1 1 x2 h x 1 Figure h 1

8 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

9 Gradient-Based Learning Specify Model Cost Design model and cost so cost is smooth Minimize cost using gradient descent or related techniques

10 Conditional Distributions and Cross-Entropy J( ) = E x,y ˆpdata log p model (y x). (6.12)

11 Output Types Output Type Output Distribution Output Layer Binary Bernoulli Sigmoid Discrete Multinoulli Softmax Continuous Gaussian Linear Continuous Mixture of Gaussian Mixture Density Cost Function Binary crossentropy Discrete crossentropy Gaussian crossentropy (MSE) Cross-entropy Continuous Arbitrary See part III: GAN, VAE, FVBN Various

12 Mixture Density Outputs y x Figure 6.4

13 Don t mix and match Sigmoid output with target of 1 (z) Cross-entropy loss MSE loss z

14 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

15 Hidden units Use ReLUs, 90% of the time For RNNs, see Chapter 10 For some research projects, get creative Many hidden units perform comparably to ReLUs. New hidden units that perform comparably are rarely interesting.

16 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

17 Architecture Basics y Depth h 1 x 1 h 2 x 2 Width

18 Universal Approximator Theorem One hidden layer is enough to represent (not learn) an approximation of any function to an arbitrary degree of accuracy So why deeper? Shallow net may need (exponentially) more width Shallow net may overfit more

19 Exponential Representation Advantage of Depth Figure 6.5

20 Better Generalization with Greater Depth Test accuracy (percent) Layers Figure 6.6

21 Large, Shallow Models Overfit More Test accuracy (percent) , convolutional 3, fully connected 11, convolutional Number of parameters 10 8 Figure 6.7

22 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design Back-Propagation

23 Back-Propagation Back-propagation is just the chain rule of calculus dz dx = dz dy X But it s a particular implementation of the chain rule Uses dynamic programming (table filling) Avoids recomputing repeated subexpressions Speed vs memory tradeoff dy dx. (6.44) 2 R 2 > r x z = r y z,

24 Simple Back-Prop Example Compute loss y Compute activations Forward prop h 1 x 1 h 2 x 2 Back-prop Compute derivatives

25 CHAPTER 6. DEEP FEEDFORWARD NETWORKS Computation Graphs y Multiplication Logistic regression z u(1) u(2) + dot y x x w (a) (b) u(2) H ReLU layer U relu (1) U (2) y u dot matmul W (c) b Figure 6.8 x u(3) sum + X b (1) sqr Linear regression and weight decay w (d) Figure 6.8: Examples of computational graphs. (a)the graph using the operation to compute z = xy. (b)the graph for the logistic regression prediction y = x> w + b.

26 Repeated Subexpressions z f f f @w (6.50) (6.51) =f 0 (y)f 0 (x)f 0 (w) (6.52) =f 0 (f(f(w)))f 0 (f(w))f 0 (w) (6.53) w Figure 6.9 Back-prop avoids computing this twice

27 Symbol-to-Symbol Differentiation f z f z Figure 6.10 y y f 0 dz dy f f x x f 0 dy dx dz dx f f w w f 0 dx dw dz dw

28 APTER 6. DEEP FEEDFORWARD NETWORKS Neural Network Loss Function JMLE J cross_entropy U (2) + y u(8) matmul H W (2) relu sqr U (5) sum u(6) u(7) + U (1) matmul Figure 6.11 X W (1) sqr U (3) sum u(4) ure 6.11: The computational graph used to compute the cost used to train our example

29 Hessian-vector Products h(r Hv = r x x f(x)) > i v. (6.59)

30 Questions

CS60010: Deep Learning

CS60010: Deep Learning Sudeshna Sarkar Spring 2018 16 Jan 2018 FFN Goal: Approximate some unknown ideal function f : X! Y Ideal classifier: y = f*(x) with x and category y Feedforward Network: Define parametric