Neural networks COMS 4771

Size: px

Start display at page:

Download "Neural networks COMS 4771"

Timothy Samson McKinney
5 years ago
Views:

1 Neural networks COMS 4771

2 1. Logistic regression

3 Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a particular form: Y X = x Bern(logistic(x T w)), x R d, with logistic(z) := 1 ez = 1 + e z 1 + e, z R. z Parameters: w = (w 1,..., w d ) R d. Conditional probability function: η w(x) = logistic(x T w). 1 / 22

4 Logistic regression Network diagram for η w: x 1 x 2 w 1 w 2 v. w d x d d v := g(z), z := w jx j, j=1 (g = logistic). Here, g is called the link function. 2 / 22

5 Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. 3 / 22

6 Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. Another option: Squared loss ERM (with link function g) 1 min w R d n n (g(x T i w) y i) 2. i=1 3 / 22

7 Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. Another option: Squared loss ERM (with link function g) 1 min w R d n n (g(x T i w) y i) 2. i=1 Observe that for any (X, Y ) P (not necessarily logistic regression), [ ( ) ] 2 ( 2 E g(x T w) Y X = x = g(x T w) η(x)) + var(y X = x) where η(x) = P(Y = 1 X = x). 3 / 22

8 Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. Another option: Squared loss ERM (with link function g) 1 min w R d n n (g(x T i w) y i) 2. i=1 Observe that for any (X, Y ) P (not necessarily logistic regression), [ ( ) ] 2 ( 2 E g(x T w) Y X = x = g(x T w) η(x)) + var(y X = x) where η(x) = P(Y = 1 X = x). Algorithm for Squared loss ERM with link function g? 3 / 22

9 Stochastic gradient method w {(g(x T w) y) 2} = 2(g(x T w) y) g (x T w) x. 4 / 22

10 Stochastic gradient method w {(g(x T w) y) 2} = 2(g(x T w) y) g (x T w) x. Stochastic gradient method for squared loss ERM with link function g 1: Start with some initial w (1) R d. 2: for t = 1, 2,... until some stopping condition is satisfied do 3: Pick (X (t), Y (t) ) uniformly at random from (x 1, y 1 ),..., (x n, y n ). 4: Update: w (t+1) := w (t) 2η t (g( X (t), w (t) ) Y (t) ) g ( X (t), w (t) ) X (t). 5: end for 4 / 22

11 Extensions Other loss functions (e.g., y ln (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). 5 / 22

12 Extensions Other loss functions (e.g., y ln (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). Is the overall objective function convex? 5 / 22

13 Extensions Other loss functions (e.g., y ln (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). Is the overall objective function convex? Somtimes, but not always. 5 / 22

14 Extensions Other loss functions (e.g., y ln (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). Is the overall objective function convex? Somtimes, but not always. Nevertheless, stochastic gradient method is still often effective at finding approximate local minima. 5 / 22

15 2. Multilayer neural networks

16 Two-output network x 1 v 1 x 2. v 2 x d d v j := g(z j), z j := W i,jx i, j {1, 2}. i=1 6 / 22

17 k-output network x 1 v 1 x 2 v 2. x d v k d v j := g(z j), z j := W i,jx i, i=1 j {1,..., k}. 7 / 22

18 A motivating example: multitask learning k binary prediction tasks with a single feature vector (e.g., predicting tags for images). Labeled examples are of the form (x i, (y i,1,..., y i,k )) R d {0, 1} k. 8 / 22

19 A motivating example: multitask learning k binary prediction tasks with a single feature vector (e.g., predicting tags for images). Labeled examples are of the form (x i, (y i,1,..., y i,k )) R d {0, 1} k. Option 1: k independent logistic regression models; learn w 1,..., w k by minimizing (e.g.) 1 n k (g(x T i w j) y i,j) 2. n i=1 j=1 8 / 22

20 A motivating example: multitask learning k binary prediction tasks with a single feature vector (e.g., predicting tags for images). Labeled examples are of the form (x i, (y i,1,..., y i,k )) R d {0, 1} k. Option 1: k independent logistic regression models; learn w 1,..., w k by minimizing (e.g.) 1 n k (g(x T i w j) y i,j) 2. n i=1 j=1 Suppose labels y i,1,..., y i,k are not independent. E.g., if y i,1 = 1, then also more likely that y i,2 = 1. 8 / 22

21 A motivating example: multitask learning k binary prediction tasks with a single feature vector (e.g., predicting tags for images). Labeled examples are of the form (x i, (y i,1,..., y i,k )) R d {0, 1} k. Option 1: k independent logistic regression models; learn w 1,..., w k by minimizing (e.g.) 1 n k (g(x T i w j) y i,j) 2. n i=1 j=1 Suppose labels y i,1,..., y i,k are not independent. E.g., if y i,1 = 1, then also more likely that y i,2 = 1. Option 2: Do Option 1, but also learn to combine predictions of y i,1,..., y i,k to get better predictions for each y i,j. 8 / 22

22 Multilayer neural network x 1 W (1) W (2) v (1) 1 v (2) 1 x 2 v (1) 2 v (2) 2. x d v (1) k v (2) k Columns of W 1 R d k : params. of original logistic regression models. Columns of W 2 R k k : params. of new logistic regression models to combine predictions of original models. 9 / 22

23 Multilayer neural network x 1 W (1) W (2) v (1) 1 v (2) 1 x 2 v (1) 2 v (2) 2. x d v (1) k v (2) k Columns of W 1 R d k : params. of original logistic regression models. Columns of W 2 R k k : params. of new logistic regression models to combine predictions of original models. Each node is called a unit. 9 / 22

24 Multilayer neural network x 1 W (1) W (2) v (1) 1 v (2) 1 x 2 v (1) 2 v (2) 2. x d v (1) k v (2) k Columns of W 1 R d k : params. of original logistic regression models. Columns of W 2 R k k : params. of new logistic regression models to combine predictions of original models. Each node is called a unit. Non-input and non-output units are called hidden. 9 / 22

25 Compositional structure Suppose we have two functions f W 1 : R d R k, (W 2 R d k ), f W 2 : R k R l, (W 2 R k l ), where f W (x) := g(w T x), and g applies the link function g coordinate-wise to a vector. 10 / 22

26 Compositional structure Suppose we have two functions f W 1 : R d R k, (W 2 R d k ), f W 2 : R k R l, (W 2 R k l ), where f W (x) := g(w T x), and g applies the link function g coordinate-wise to a vector. Composition: f W 1,W 2 := f W 2 f W 1 is defined by f W 1,W 2 (x) := f W 2 ( fw 1 (x) ). 10 / 22

27 Compositional structure Suppose we have two functions f W 1 : R d R k, (W 2 R d k ), f W 2 : R k R l, (W 2 R k l ), where f W (x) := g(w T x), and g applies the link function g coordinate-wise to a vector. Composition: f W 1,W 2 := f W 2 f W 1 is defined by This is a two-layer neural network. f W 1,W 2 (x) := f W 2 ( fw 1 (x) ). x 1 W (1) W (2) v (1) 1 v (2) 1 x 2 v (1) 2 v (2) 2. x d v (1) k v (2) l 10 / 22

28 Necessity of multiple layers One-layer neural network with a monotonic link function is a linear (or affine) classifier. Cannot represent XOR function (Minsky and Papert, 1969). x 1 1 x 1 x 1 1 1? x x x 2 (a) x 1 and x 2 (b) x1 or x2 (c) x1 xor x2 (Figure from Stuart Russell.) 11 / 22

29 Approximation power of multilayer neural networks Theorem (Cybenko, 1989; Hornik, 1991; Barron, 1993). Any continuous function f can be approximated arbitrarily well by a two-layer neural network f f W 2 }{{} R k R f W }{{} 1. R d R k However: may need a very large number of hidden units. 12 / 22

30 Approximation power of multilayer neural networks Theorem (Cybenko, 1989; Hornik, 1991; Barron, 1993). Any continuous function f can be approximated arbitrarily well by a two-layer neural network f f W 2 }{{} R k R f W }{{} 1. R d R k However: may need a very large number of hidden units. Theorem (Telgarsky, 2015; Eldan and Shamir, 2015). Some functions can be approximated with exponentially fewer hidden units by using more than two layers. 12 / 22

31 Approximation power of multilayer neural networks Theorem (Cybenko, 1989; Hornik, 1991; Barron, 1993). Any continuous function f can be approximated arbitrarily well by a two-layer neural network f f W 2 }{{} R k R f W }{{} 1. R d R k However: may need a very large number of hidden units. Theorem (Telgarsky, 2015; Eldan and Shamir, 2015). Some functions can be approximated with exponentially fewer hidden units by using more than two layers. Note: none of this speaks directly to learning neural networks from data. 12 / 22

32 3. Computation and learning with neural networks

33 General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) 13 / 22

34 General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. ŷ w u,v u v x 1 x 2 x d 13 / 22

35 General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. d source vertices, one per input variable, called x 1,..., x d. w u,v u ŷ v x 1 x 2 x d 13 / 22

36 General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. d source vertices, one per input variable, called x 1,..., x d. Single sink vertex, called ŷ. w u,v u ŷ v x 1 x 2 x d 13 / 22

37 General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. d source vertices, one per input variable, called x 1,..., x d. Single sink vertex, called ŷ. Each edge (u, v) E has a weight w u,v R. x 1 w u,v u x 2 ŷ v x d 13 / 22

38 General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. d source vertices, one per input variable, called x 1,..., x d. Single sink vertex, called ŷ. w u,v u ŷ v Each edge (u, v) E has a weight w u,v R. x 1 x 2 x d Value of vertex v given values of parents π G(v) := {u V : (u, v) E} is v := g(z v), z v := w u,v u. (g is link function, e.g., logistic function.) u π G (v) 13 / 22

39 Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1, / 22

40 Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. 14 / 22

41 Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. 14 / 22

42 Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. 14 / 22

43 Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 14 / 22

44 Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 1. Compute values of all vertices in V 1, given values of vertices in V 0 (i.e., input variables). v := g(z v), z v := w u,v u. (All parents of v V 1 are in V 0.) u π G (v) 14 / 22

45 Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 1. Compute values of all vertices in V 1, given values of vertices in V 0 (i.e., input variables). v := g(z v), z v := w u,v u. (All parents of v V 1 are in V 0.) u π G (v) 2. Compute values of all vertices in V 2, given values of vertices in V 0 V 1. (All parents of v V 2 are in V 0 V 1.) 14 / 22

46 Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 1. Compute values of all vertices in V 1, given values of vertices in V 0 (i.e., input variables). v := g(z v), z v := w u,v u. (All parents of v V 1 are in V 0.) u π G (v) 2. Compute values of all vertices in V 2, given values of vertices in V 0 V 1. (All parents of v V 2 are in V 0 V 1.) 3. Etc., until ŷ = f(x) is computed. 14 / 22

47 Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 1. Compute values of all vertices in V 1, given values of vertices in V 0 (i.e., input variables). v := g(z v), z v := w u,v u. (All parents of v V 1 are in V 0.) u π G (v) 2. Compute values of all vertices in V 2, given values of vertices in V 0 V 1. (All parents of v V 2 are in V 0 V 1.) 3. Etc., until ŷ = f(x) is computed. This is called forward propagataion. 14 / 22

48 Training a neural network How to fit a neural network to data? 15 / 22

49 Training a neural network How to fit a neural network to data? Use stochastic gradient method! 15 / 22

50 Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. 15 / 22

51 Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. 15 / 22

52 Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. Given: labeled example (x, y) R d R; current weights w u,v R for all (u, v) E; 15 / 22

53 Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. Given: labeled example (x, y) R d R; current weights w u,v R for all (u, v) E; values v and z v for all non-source v V on input x. (Can first run forward propagation to get v s and z v s.) 15 / 22

54 Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. Given: labeled example (x, y) R d R; current weights w u,v R for all (u, v) E; values v and z v for all non-source v V on input x. (Can first run forward propagation to get v s and z v s.) Let l denote loss of prediction ŷ = f(x) (e.g., l := (ŷ y) 2 ). 15 / 22

55 Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. Given: labeled example (x, y) R d R; current weights w u,v R for all (u, v) E; values v and z v for all non-source v V on input x. (Can first run forward propagation to get v s and z v s.) Let l denote loss of prediction ŷ = f(x) (e.g., l := (ŷ y) 2 ). Goal: Compute l w u,v, (u, v) E. 15 / 22

56 Backpropagation: exploiting the chain rule Strategy: use chain rule. l = l w u,v ŷ ŷ v v. w u,v 16 / 22

57 Backpropagation: exploiting the chain rule Strategy: use chain rule. l = l w u,v ŷ ŷ v v. w u,v For squared loss l = (ŷ y) 2, l = 2(ŷ y). ŷ Easy to compute with other losses as well. (ŷ is computed in forward propagation.) 16 / 22

58 Backpropagation: exploiting the chain rule Strategy: use chain rule. l = l w u,v ŷ ŷ v v. w u,v For squared loss l = (ŷ y) 2, l = 2(ŷ y). ŷ Easy to compute with other losses as well. (ŷ is computed in forward propagation.) Since v = g(z v) where z v = w u,v u + (terms not involving w u,v), v = v z v w u,v z v w u,v = g (z v) u. w u,v (z v and u are computed in forward propagation.) v u 16 / 22

59 Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. 17 / 22

60 Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. Strategy: for v ŷ, use multivariate chain rule. Let k = out-deg(v), (v, v 1),..., (v, v k ) E: ŷ. ŷ v = k i=1 ŷ v i vi v. v 1 v k w v,v1 w v,vk v 17 / 22

61 Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. Strategy: for v ŷ, use multivariate chain rule. Let k = out-deg(v), (v, v 1),..., (v, v k ) E: ŷ. ŷ v = k i=1 ŷ v i vi v. v 1 v k w v,v1 w v,vk v Since v i = g(z vi ) where z vi = w v,vi v + (terms not involving v), v i v = g (z vi ) w v,vi. (The z vi s are computed in forward propagation.) 17 / 22

62 Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. Strategy: for v ŷ, use multivariate chain rule. Let k = out-deg(v), (v, v 1),..., (v, v k ) E: ŷ. ŷ v = k i=1 ŷ v i vi v. v 1 v k w v,v1 w v,vk v Since v i = g(z vi ) where z vi = w v,vi v + (terms not involving v), v i v = g (z vi ) w v,vi. (The z vi s are computed in forward propagation.) Since v i are in a higher layer than v, ŷ v i has already been computed! 17 / 22

63 Matrix view of forward/backward propagation Recall general neural network function f : R d R can be written as f(x) = g L (W L g 2 (W 2g 1 (W 1x)) ), where W i R d i d i 1 is weight matrix for layer i, and g i : R d i R d i collects the non-linear link functions for layer i. 18 / 22

64 Matrix view of forward/backward propagation Recall general neural network function f : R d R can be written as f(x) = g L (W L g 2 (W 2g 1 (W 1x)) ), where W i R d i d i 1 is weight matrix for layer i, and g i : R d i R d i collects the non-linear link functions for layer i. The first layer thus computes and i-th layer computes (Convention: f 0 (x) := x.) f 1 (x) := g 1 (z 1) where z 1 := W 1x, f i (x) := g i (z i) where z i := W if i 1 (x). 18 / 22

65 Matrix view of forward/backward propagation Recall general neural network function f : R d R can be written as f(x) = g L (W L g 2 (W 2g 1 (W 1x)) ), where W i R d i d i 1 is weight matrix for layer i, and g i : R d i R d i collects the non-linear link functions for layer i. The first layer thus computes and i-th layer computes (Convention: f 0 (x) := x.) f 1 (x) := g 1 (z 1) where z 1 := W 1x, f i (x) := g i (z i) where z i := W if i 1 (x). Previous backprop equations prove correctness of the following matrix derivative formula: [ ( l ( ) )] = g W i(z i) W T i+1 g L 1(z L 1) W T L g L(z L)l (f L(x)) f i 1(x) T i where denotes element-wise product. 18 / 22

66 Practical tips Apply stochastic gradient method to examples in random order. Can use several examples to form gradient estimate: mini-batch. λ (t) := 1 b tb i=(t 1)b+1 where l i is loss on i-th training example. l i W W =W (t) 19 / 22

67 Practical tips Apply stochastic gradient method to examples in random order. Can use several examples to form gradient estimate: mini-batch. λ (t) := 1 b tb i=(t 1)b+1 where l i is loss on i-th training example. l i W W =W (t) Standardize inputs (i.e., center and divide by standard deviation). Doing this at every layer during training: batch normalization. (Must also apply same/similar normalization at test time.) 19 / 22

68 Practical tips Apply stochastic gradient method to examples in random order. Can use several examples to form gradient estimate: mini-batch. λ (t) := 1 b tb i=(t 1)b+1 where l i is loss on i-th training example. l i W W =W (t) Standardize inputs (i.e., center and divide by standard deviation). Doing this at every layer during training: batch normalization. (Must also apply same/similar normalization at test time.) Initialization: Take care so initial weights not too large or small. E.g., for node with d inputs, draw weights iid from N(0, 1/d). 19 / 22

69 Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. 20 / 22

70 Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. Common link functions: ReLU z i max{0, z i} (applied element-wise) SoftMax z i ez i j ez j (applied element-wise) max-pooling z max i z i (after convolution layers; R d R), 20 / 22

71 Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. Common link functions: ReLU z i max{0, z i} (applied element-wise) SoftMax z i ez i j ez j (applied element-wise) max-pooling z max i z i (after convolution layers; R d R), Trend in architectures (in some domains, like vision and speech): Few convolutional layers then many dense layers (AlexNet) More convolutional layers (VGGNet) Nearly purely convolutional layers in many (100+) layers with varienty of identity connections throughout (ResNet, DenseNet). 20 / 22

72 Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. Common link functions: ReLU z i max{0, z i} (applied element-wise) SoftMax z i ez i j ez j (applied element-wise) max-pooling z max i z i (after convolution layers; R d R), Trend in architectures (in some domains, like vision and speech): Few convolutional layers then many dense layers (AlexNet) More convolutional layers (VGGNet) Nearly purely convolutional layers in many (100+) layers with varienty of identity connections throughout (ResNet, DenseNet). Can use intermediate computation (e.g., f i (x)) as feature expansion! Indespensible in visual and audio tasks; application attempts are constant in all other disciplines. 20 / 22

73 Wrap-up Frontier of experimental machine learning research. 21 / 22

74 Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. 21 / 22

75 Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. 21 / 22

76 Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. 21 / 22

77 Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. Many-layered networks (made possible through many adjustments). 21 / 22

78 Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. Many-layered networks (made possible through many adjustments). Applications: visual detection and recognition, speech recognition, general function fitting (e.g., learning reward functions of different actions of video games), etc. 21 / 22

79 Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. Many-layered networks (made possible through many adjustments). Applications: visual detection and recognition, speech recognition, general function fitting (e.g., learning reward functions of different actions of video games), etc. In cutting-edge applications, training is still delicate. 21 / 22

80 Key takeaways 1. Structure of neural networks; concept of link functions. 2. Motivation for multilayer neural networks and arguments for depth. 3. Forward and backward propagation algorithms. 22 / 22

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation