Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Size: px

Start display at page:

Download "Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville"

Francine Mathews
6 years ago
Views:

1 Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 1 / 25

2 Linear regression (and classication) Input vector x Output vector y Parameters Weight W and bias b Prediction : y = W x + b Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25

3 Linear regression (and classication) Input vector x Output vector y Parameters Weight W and bias b Prediction : y = W x + b W b x u y Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25

4 Linear regression (and classication) Input vector x Output vector y Parameters Weight W and bias b Prediction : y = W x + b x 1 W W 23 b 1 u 1 y 1 x 2 b 2 u 2 y 2 x 3 Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25

5 Linear regression (and classication) Input vector x Output vector y Parameters Weight W and bias b Prediction : y = W x + b x 1 W, b y 1 x 2 y 2 x 3 Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25

6 Linear regression (and classication) Advantages Easy to use Easy to train, low risk of overtting Drawbacks Some problems are inherently non-linear Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 3 / 25

7 Solving XOR Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25

8 Solving XOR Linear regressor x 1 W, b y x 2 There is no value for W and b such that (x 1, x 2 ) {0, 1} 2 W ( x 1 x 2 ) + b = xor(x 1, x 2 ) Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25

9 Solving XOR What about...? x 1 W, b u 1 V, c y x 2 u 2 Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25

10 Solving XOR What about...? x 1 W, b u 1 V, c y x 2 u 2 Strictly equivalent : The composition of two linear operation is still a linear operation Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25

11 Solving XOR And about...? x 1 W, b φ u 1 h 1 V, c y x 2 u 2 h 2 In which φ(x) = max{0, x} Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25

12 Solving XOR And about...? x 1 W, b φ u 1 h 1 V, c y x 2 u 2 h 2 In which φ(x) = max{0, x} It is possible! ( ) 1 1 With W =, b = 1 1 ( ) ( ) 0 1, V = and c = 0, 1 2 Vφ(Wx + b) = xor(x 1, x 2 ) Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25

13 Neural network with one hidden layer Compact representation W, b, φ V, c x h y Neural network Hidden layer with non-linearity can represent broader class of function Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 7 / 25

14 Universal approximation theorem Theorem A neural network with one hidden layer can approximate any continuous function More formally, given a continuous function f : C n R m where C n is a compact subset of R n, ε, f ε NN : x K v i φ(w i x + b i ) + c i=1 such that x C n, f (x) f ε NN(x) < ε Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 8 / 25

15 Problems Obtaining the network The universal theorem gives no information about HOW to obtain such a network Size of the hidden layer h Values of W and b Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25

16 Problems Obtaining the network The universal theorem gives no information about HOW to obtain such a network Size of the hidden layer h Values of W and b Using the network Even if we nd a way to obtain the network, the size of the hidden layer may be prohibitively large. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25

17 Deep neural network Why Deep? Let's stack l hidden layers one after the other; l is called the length of the network. W 1, b 1, φ W 2, b 2, φ W l, b l, φ V, c x h 1... h l y Properties of DNN The universal approximation theorem also apply Some functions can be approximated by a DNN with N hidden unit, and would require O(e N ) hidden units to be represented by a shallow network. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 10 / 25

18 Summary Comparison Linear classier Limited representational power + Simple Shallow Neural network (Exactly one hidden layer) + Unlimited representational power Sometimes prohibitively wide Deep Neural network + Unlimited representational power + Relatively small number of hidden units needed Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25

19 Summary Comparison Linear classier Limited representational power + Simple Shallow Neural network (Exactly one hidden layer) + Unlimited representational power Sometimes prohibitively wide Deep Neural network + Unlimited representational power + Relatively small number of hidden units needed Remaining problem How to get this DNN? Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25

20 On the path of getting my own DNN Hyperparameters First, we need to dene the architecture of the DNN The depth l The size of the hidden layers n 1,..., n l The activation function φ The output unit Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25

21 On the path of getting my own DNN Hyperparameters First, we need to dene the architecture of the DNN The depth l The size of the hidden layers n 1,..., n l The activation function φ The output unit Parameters When the architecture is dened, we need to train the DNN W 1, b 1,..., W l, b l Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25

22 Hyperparameters The depth l The size of the hidden layers n 1,..., n l Strongly depend on the problem to solve Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25

23 Hyperparameters The depth l The size of the hidden layers n 1,..., n l Strongly depend on the problem to solve The activation function φ ReLU g : x max{0, x} Sigmoid σ : x (1 + e x ) 1 Many others : tanh, RBF, softplus... Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25

24 Hyperparameters The depth l The size of the hidden layers n 1,..., n l Strongly depend on the problem to solve The activation function φ ReLU g : x max{0, x} Sigmoid σ : x (1 + e x ) 1 Many others : tanh, RBF, softplus... The output unit Linear output E[y] = V h l + c For regression with Gaussian distribution y N (E[y], I) Sigmoid output ŷ = σ(w h l + b) For classication with Bernouilli distribution P(y = 1 x) = ŷ Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25

25 Parameters Training Objective Let's dene θ = (W 1, b 1,..., W l, b l ). We suppose we have a set of inputs X = (x 1,..., x N ) and a set of expected outputs Y = (y 1,..., y N ). The goal is to nd a neural network f NN such that i, f NN (x i, θ) y i. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 14 / 25

26 Parameters Training Cost function To evaluate the error that our current network makes, let's dene a cost function L(X, Y, θ). The goal becomes to nd Loss function argmin L(X, Y, θ) θ Should represent a combination of the distances between every y i and the corresponding f NN (x i, θ) Mean square error (rare) Cross-entropy Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 15 / 25

27 Parameters Training Find the minimum The basic idea consists in computing ˆθ such that θ L(X, Y, ˆθ) = 0. This is dicult to solve analytically e.g. when θ have millions of degrees of freedom. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 16 / 25

28 Parameters Training Gradient descent Let's use a numerical way to optimize θ, called the gradient descent (section 4.3). The idea is that f (θ εu) f (θ) εu f (θ) So if we take u = f (θ), we have u u > 0 and then f (θ εu) f (θ) εu u < f (θ). If f is a function to minimize, we have an update rule that improves our estimate. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 17 / 25

29 Parameters Training Gradient descent algorithm 1 Have an estimate ˆθ of the parameters 2 Compute θ L(X, Y, ˆθ) 3 Update ˆθ ˆθ ε θ L 4 Repeat step 2-3 until θ L < threshold Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25

30 Parameters Training Gradient descent algorithm 1 Have an estimate ˆθ of the parameters 2 Compute θ L(X, Y, ˆθ) 3 Update ˆθ ˆθ ε θ L 4 Repeat step 2-3 until θ L < threshold Problem How to estimate eciently θ L(X, Y, ˆθ)? Back-propagation algorithm Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25

31 Back-propagation for Parameter Learning Consider the architecture: w 2 w 1 y φ φ x with function: y = φ ( w 2 φ(w 1 x) ), some training pairs T = {ˆx n, ŷ n } N n=1, and an activation-function φ(). Learn w 1, w 2 so that: Feeding ˆx n results ŷ n. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 19 / 25

32 Prerequisite: dierentiable activation function For learning to be possible φ() has to be dierentiable. Let φ (x) = φ(x) denote the derivative of φ(x). x For example when φ(x) = Relu(x) we have: Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 20 / 25

33 Gradient-based Learning Minimize the loss function L(w 1, w 2, T ). We will learn the weights by iterating: [ w 1 w 2 ] updated [ = w 1 w 2 ] γ L w 1 L w 2, (1) L is the loss function (must be dierentiable): In detail is L(w 1, w 2, T ) and we want to compute the gradient(s) at w 1, w 2. γ is the learning rate (a scalar typically known). Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 21 / 25

34 Back-propagation Calculate intermediate values on all units: 1 a = w 1ˆx n. 2 b = φ(w 1ˆx n ). 3 c = w 2 φ(w 1ˆx n ). 4 d = φ ( w 2 φ(w 1ˆx n ) ). 6 The partial derivatives are: L(d) d = L (d). 7 d c = φ (w 2 φ(w 1ˆx n )). 8 c b = w 2. 9 b a = φ (w 1ˆx n ). 5 L(d) = L ( φ ( w 2 φ(w 1ˆx n ) )). Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 22 / 25

35 Calculating the Gradients I Apply chain rule: L w 1 L(d) w 2 = L d c b a, d c b a w 1 = L(d) d c. d c w 2 Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25

36 Calculating the Gradients I Apply chain rule: L w 1 L(d) w 2 = L d c b a, d c b a w 1 = L(d) d c. d c w 2 Start the calculation from left-to-right. We propage the gradients (partial products) from the last layer towards the input. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25

37 Calculating the Gradients And because we have N training pairs: L w 1 = L w 2 = N L(d n ) n=1 d n N L(d n ) n=1 d n d n c n c n b n b n a n a n w 1, d n c n c n w 2. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 24 / 25

38 Thank you! Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 25 / 25

Deep Feedforward Networks

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3