CS 453X: Class 20. Jacob Whitehill

Size: px

Start display at page:

Download "CS 453X: Class 20. Jacob Whitehill"

Meredith Banks
5 years ago
Views:

1 CS 3X: Class 20 Jacob Whitehill

2 More on training neural networks

3 Training neural networks While training neural networks by hand is (arguably) fun, it is completely impractical except for toy examples. How can we find good values for the weights and bias terms automatically? Gradient descent. x2 xm W (1) W (2) z1 zk b (1) b (2) ŷ1 ŷl

4 Training neural networks While training neural networks by hand is (arguably) fun, it is completely impractical except for toy examples. How can we find good values for the weights and bias terms automatically? Gradient descent. x2 xm W (1) W (2) z1 zk b (1) b (2) ŷ1 ŷl

5 Gradient descent: XOR problem Here is how we can conduct gradient descent for the XOR problem Let s first define: W (1) = apple w1 w 2 w 3 w, W (2) = apple w w 6, b (1) = apple b1 b 2, b (2) = b 3 Then we can define g so that: ŷ = g apple = W (2) W (1) x + b (1) + b (2) apple apple apple apple W (1) w w1 ww (2) = 2 b1 + + b w 6 z1 w 3 w b 3 ŷ 2 = w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3 x2 z2 b (1) b (2)

6 Gradient descent: XOR problem Here is how we can conduct gradient descent for the XOR problem Let s first define: W (1) = apple w1 w 2 w 3 w, W (2) = apple w w 6, b (1) = apple b1 b 2, b (2) = b 3 Then we can define g so that: ŷ = g apple = W (2) W (1) x + b (1) + b (2) = apple w T w 6 apple w1 w 2 w 3 w apple + apple b1 b 2 + b 3 = w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3

7 Gradient descent: XOR problem Here is how we can conduct gradient descent for the XOR problem Let s first define: W (1) = apple w1 w 2 w 3 w, W (2) = apple w w 6, b (1) = apple b1 b 2, b (2) = b 3 Then we can define g so that: ŷ = g apple = W (2) W (1) x + b (1) + b (2) = apple w T w 6 apple w1 w 2 w 3 w apple + apple b1 b 2 + b 3 = w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3

8 Gradient descent: XOR problem From ŷ, we can compute the fmse cost as: f MSE (ŷ; w 1,w 2,w 3,w,w,w 6,b 1,b 2,b 3 )= 1 (ŷ y)2 2 We then then calculate the derivative of fmse w.r.t. each parameter p using the chain rule as: where:

9 Gradient descent: XOR problem From ŷ, we can compute the fmse cost as: f MSE (ŷ; w 1,w 2,w 3,w,w,w 6,b 1,b 2,b 3 )= 1 2 (ŷ y)2 We then calculate the derivative of fmse w.r.t. each parameter p using the chain rule as: where: = MSE =(ŷ y)

10 Gradient descent: XOR problem Now we just have to differentiate ŷ = g(x) w.r.t each parameter p:, e.g.: 1 [w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3 ] = w 0 (w 1 x 1 + w 2 + b 1 )x 1

11 Gradient descent: XOR problem Now we just have to differentiate ŷ = g(x) w.r.t each parameter p:, e.g.: 1 [w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3 ] = w 0 (w 1 x 1 + w 2 + b 1 )x 1 This is the only term that depends on w1. It has the form: c*σ(z) where z is a function of w1. Recall [c (z)] 1 1 = c 1

12 Gradient descent: XOR problem Now we just have to differentiate ŷ = g(x) w.r.t each parameter p:, e.g.: 1 [w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3 ] = w 0 (w 1 x 1 + w 2 + b 1 )x 1

13 Gradient descent: XOR problem Now we just have to differentiate ŷ = g(x) w.r.t each parameter p:, e.g.: 1 [w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3 ] = w 0 (w 1 x 1 + w 2 + b 1 )x 1

14 Gradient descent: XOR problem Now we just have to differentiate ŷ = g(x) w.r.t each parameter p:, e.g.: 1 [w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3 ] = w 0 (w 1 x 1 + w 2 + b 1 )x 1

15 Gradient descent: XOR problem Now we just have to differentiate ŷ = g(x) w.r.t each parameter p:, e.g.: 1 [w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3 ] = w 0 (w 1 x 1 + w 2 + b 1 )x 1 0 (z) = 0 if z apple 0 1 if z>0 for ReLU

16 Gradient descent: XOR problem Hence: 1 since: 0 if w1 x 1 + w 2 + b 1 apple 0 w x 1 if w 1 x 1 + w 2 + b 1 > 0 0 (z) = 0 if z apple 0 1 if z>0 for ReLU

17 Exercise 2 [w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3 2 = w 0 6 (w 3 x 1 + w + b 2 ) = 0 if w3 x 1 + w + b 2 apple 0 w 6 if w 3 x 1 + w + b 2 > 0

18 Exercise 2 [w (w 1 x 1 + w 2 + b 1 )+w 6 (w 3 x 1 + w + b 2 )+b 3 2 = w 0 6 (w 3 x 1 + w + b 2 ) = 0 if w3 x 1 + w + b 2 apple 0 w 6 if w 3 x 1 + w + b 2 > 0

19 Gradient descent: XOR problem We also need to derive the other partial 1 Based on these, we can compute the gradient terms: 1 6 = 1 = = MSE = 3 W (1) W (2) z1 ŷ x2 z2 b (1) b (2)

20 Gradient descent: XOR problem Given the gradients, we can conduct gradient descent: For each epoch: For each minibatch: Compute all 9 partial derivatives (summed over all examples in the minibatch): Update w1: Update w6: Update b1: Update b3: w 1 w 1 1 w 6 w 6 6 b 1 b 1 1 b 3 b 3 3

21 Gradient descent: (any 3-layer NN) It is more compact and efficient to represent and compute these gradients more abstractly: r W (1)f MSE r W (2)f MSE r b (1)f MSE r b (2)f MSE where r W (1)f MSE = etc. " # W (1) W (2) z1 ŷ x2 z2 b (1) b (2)

22 Gradient descent (any 3-layer NN) Given the gradients, we can conduct gradient descent: For each epoch: For each minibatch: Compute all gradient terms (summed over all examples in the minibatch): Update W (1) : Update W (2) : Update b (1) : Update b (2) : W (1) W (1) r W (1)f MSE W (2) W (2) r W (2)f MSE b (1) b (1) r b (1)f MSE b (2) b (2) r b (2)f MSE

23 Deriving the gradient terms In the XOR problem, we derived each element of each gradient term by hand. For large networks with many layers, this would be prohibitively tedious. Fortunately, we can largely automate the process of computing each gradient term W (1), W (2),, W (d) (for d layers). This procedure is enabled by the chain rule of multivariate calculus

24 Jacobian matrices & the chain rule of multivariate calculus

25 Jacobian matrices Consider a function f that takes a vector as input and produces a vector as output, e.g.: f apple What is the set of f s partial derivatives? = 2 2x 1 + x 1 / = 1/ = x 1 / 2

26 Jacobian matrices Consider a function f that takes a vector as input and produces a vector as output, e.g.: f apple What is the set of f s partial derivatives? = 2 2x 1 + x 1 / = 1/ = x 1 / 2

27 Jacobian matrices Consider a function f that takes a vector as input and produces a vector as output, e.g.: f apple What is the set of f s partial derivatives? 2 = x 1 + x 1 / 1/ x 1 / 2 Jacobian matrix 3 3

28 Jacobian matrices For any function f : R m! R n, we can define the Jacobian matrix of all partial = Columns are the inputs to f m m 3 7 Row are the outputs of f.

29 Chain rule of multivariate calculus Suppose f : R m! R n and g : R k! R m. Note that the Jacobian order matters!, of f. i.e.:

30 Chain rule of multivariate calculus Suppose f : R m! R n and g : R k! R m. Note that the order Jacobian matters!, of g. i.e.:

31 Chain rule of multivariate calculus Suppose f : R m! R n and g : R k! R m. Note that the order matters!,

32 Chain rule of multivariate calculus: example Suppose: f x 1 x 3 What is 31 A = x x 3 / g apple? Two alternative methods to derive: 1.Substitute g into f and differentiate. 2.Apply chain rule. = 2 x 1 +2 x

33 Chain rule of multivariate calculus: example Suppose: f x 1 x 3 What is 31 A = x x 3 / g Two alternative methods: 1.Substitute g into f and differentiate. 2.Apply chain rule. = 2 x 1 +2 x

34 Chain rule of multivariate calculus: example f x 1 x 3 31 A = x x 3 / g apple = 2 x 1 +2 x Substitution: f g apple = f g apple = x ( )+( x 1 + 3)/ = 3 (x 1 + 1) = 3 0

35 Chain rule of multivariate calculus: example f x 1 x 3 31 A = x x 3 / g apple = 2 x 1 +2 x Substitution: f g apple = f g apple = x ( )+( x 1 + 3)/ = 3 (x 1 + 1) = 3 0

36 Chain rule of multivariate calculus: example f x 1 x 3 31 A = x x 3 / g apple = 2 x 1 +2 x Substitution: f g apple = f g apple = x ( )+( x 1 + 3)/ = 3 (x 1 + 1) = 3 0

37 Chain rule of multivariate calculus: example f x 1 x 3 31 A = x x 3 / g apple = 2 x 1 +2 x Substitution: f g apple = f g apple = x ( )+( x 1 + 3)/ = 3 (x 1 + 1) = 3 0

38 Chain rule of multivariate calculus: example f x 1 x 3 31 A = x x 3 / g apple = 2 x 1 +2 x Chain @x = = 1 1? = = x

39 Chain rule of multivariate calculus: example f x 1 x 3 31 A = x x 3 / g apple = 2 x 1 +2 x Chain @x = = = = x

40 Chain rule of multivariate calculus: example f x 1 x 3 31 A = x x 3 / g apple = 2 x 1 +2 x Chain @x = = = = x

41 Chain rule of multivariate calculus: example f x 1 x 3 31 A = x x 3 / g apple = 2 x 1 +2 x Chain @x = = = = x

42 Chain rule of multivariate calculus: example f x 1 x 3 31 A = x x 3 / g apple = 2 x 1 +2 x Chain @x = = = = x

43 f: R m R: Jacobian versus gradient Note, for a real-valued function f : R m! R, the Jacobian matrix is the transpose of the gradient. r x f(x) = 2 m 3 7 m i Gradient Jacobian

44 Jacobian matrices Note that we sometimes examine the same function from different perspectives, e.g.: f(x; w) =f(x, y; a, b) =2ax + b/y

45 Jacobian matrices Note that we sometimes examine the same function from different perspectives, e.g.: f(x; w) =f(x, y; a, b) =2ax + b/y We can consider x, y to be the variables and a, b to be = 2a b/y 2

46 Jacobian matrices Note that we sometimes examine the same function from different perspectives, e.g.: f(x; w) =f(x, y; a, b) =2ax + b/y We can consider a, b to be the variables and x, y to be = 2x 1/y

47 Computing the gradients Jacobian matrices and the chain rule provide a recipe for how to compute all the gradient terms efficiently. Consider the 3-layer NN below: From x, W (1), and b (1), we can compute z. From z and σ, we can compute h = σ(z). From h, W (2), and b (2), we can compute ŷ. x2 xm W (1) W (2) z1 h1 zk hk b (1) b (2) ŷ1 ŷl

48 Computing the gradients Jacobian matrices and the chain rule provide a recipe for how to compute all the gradient terms efficiently. Consider the 3-layer NN below: From x, W (1), and b (1), we can compute z. From z and σ, we can compute h = σ(z). From h, W (2), and b (2), we can compute ŷ. x2 xm W (1) W (2) z1 h1 zk hk b (1) b (2) ŷ1 ŷl

49 Computing the gradients Jacobian matrices and the chain rule provide a recipe for how to compute all the gradient terms efficiently. Consider the 3-layer NN below: From x, W (1), and b (1), we can compute z. From z and σ, we can compute h = σ(z). From h, W (2), and b (2), we can compute ŷ. x2 xm W (1) W (2) z1 h1 zk hk b (1) b (2) ŷ1 ŷl

50 Computing the gradients Jacobian matrices and the chain rule provide a recipe for how to compute all the gradient terms efficiently. This process is known as forward propagation. It produces all the intermediary (h, z) and final (ŷ) network outputs. From h, W (2), and b (2), we can compute ŷ. x2 xm W (1) W (2) z1 h1 zk hk b (1) b (2) ŷ1 ŷl

51 Computing the gradients Jacobian matrices and the chain rule provide a recipe for how to compute all the gradient terms efficiently. This process is known as forward propagation. It can be applied a NN of arbitrary depth; in the NN below, it would produce z (1), h (1), z (2), h (2), and ŷ. Hidden layer 1 Hidden layer 2 x2 xm W (1) W (2) W (3) z1 h1 zk hk b (1) b (3) b (2) z1 zq h1 hq ŷ1 ŷl

52 Computing the gradients Now, let s look at how to compute each gradient term: (2) (1) x2 x3 W (1) W (2) z1 h1 z2 h2 b (1) b (2) ŷ f

53 Computing the gradients Now, let s look at how to compute each gradient term: (2) (1) Redundant computation x2 x3 W (1) W (2) z1 h1 z2 h2 b (1) b (2) ŷ f

54 Computing the gradients Here s how we can compute all these efficiently: (2) (1) x2 W (1) W (2) z1 h1 ŷ f x3 z2 h2 b (1) b (2)

55 Computing the gradients Here s how we can compute all these efficiently: (2) (1) x2 W (1) W (2) z1 (2) ŷ f x3 z2 h2 b (1) b (2)

56 Computing the gradients Here s how we can compute all these efficiently: (2) (1) x2 W (1) W (2) z1 (2) ŷ f x3 z2 h2 b (1) b (2)

57 Computing the gradients Here s how we can compute all these efficiently: (2) (1) x2 W W (2) z1 h1 ŷ f x3 z2 h2 b (1) b (2)

58 Computing the gradients Here s how we can compute all these efficiently: (2) (1) x2 W (2) h1 ŷ f x3 z2 h2 b (1) b (2)

59 Computing the gradients Here s how we can compute all these efficiently: (2) (1) W (2) h1 ŷ f x3 z2 h2 b (1) b (2)

60 Computing the gradients Here s how we can compute all these efficiently: (2) (1) W (2) h1 ŷ (1) z2 h2 b (1) b (2)

61 Computing the gradients For the simple NN below, let s see how we can use the chain rule to compute each gradient term (2) How does ŷ depend on W (2)? ŷ = W (2) h = W (2) 1 h 1 + W (2) 2 h 2 =) = (2) 1 h 2 = h x2 x3 W (1) W (2) z1 h1 z2 h2 b (1) b (2) ŷ f

62 Computing the gradients For the simple NN below, let s see how we can use the chain rule to compute each gradient term (2) How does ŷ depend on W (2)? ŷ = W (2) h = W (2) 1 h 1 + W (2) 2 h 2 =) = (2) 1 h 2 = h T x2 x3 W (1) W (2) z1 h1 z2 h2 b (1) b (2) ŷ f

CS 453X: Class 5. Jacob Whitehill

CS 453X: Class 5. Jacob Whitehill CS 453X: Class 5 Jacob Whitehill Example 2: weather data Data from https://www.kaggle.com/selfishgene/historicalhourly-weather-data/data Hourly measurements of pressure, humidity, and temperature, and