Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Size: px

Start display at page:

Download "Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch."

Sabina Perry
5 years ago
Views:

1 Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5

2 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on multi-layer perceptrons Mathematical properties rather than plausibility Prof. Hadley CMPT418

3 Uses of Neural Networks Pros Good for continuous input variables. General continuous function approximators. Highly non-linear. Learn feature functions. Good to use in continuous domains with little knowledge: When you don t know good features. You don t know the form of a good functional model. Cons Not interpretable, black box. Learning is slow. Good generalization can require many datapoints.

4 Applications There are many, many applications. World-Champion Backgammon Player. No Hands Across America Tour. Digit Recognition with 99.26% accuracy....

5 Outline Feed-forward Networks Network Training Error Backpropagation Applications

6 Outline Feed-forward Networks Network Training Error Backpropagation Applications

7 Feed-forward Networks We have looked at generalized linear models of the form: M y(x, w) = f w j φ j (x) for fixed non-linear basis functions φ( ) We now extend this model by allowing adaptive basis functions, and learning their parameters In feed-forward networks (a.k.a. multi-layer perceptrons) we let each basis function be another non-linear function of linear combination of the inputs: M φ j (x) = f... j=1 j=1

8 Feed-forward Networks We have looked at generalized linear models of the form: M y(x, w) = f w j φ j (x) for fixed non-linear basis functions φ( ) We now extend this model by allowing adaptive basis functions, and learning their parameters In feed-forward networks (a.k.a. multi-layer perceptrons) we let each basis function be another non-linear function of linear combination of the inputs: M φ j (x) = f... j=1 j=1

9 Feed-forward Networks Starting with input x = (x 1,..., x D ), construct linear combinations: D a j = w (1) ji x i + w (1) j0 i=1 These a j are known as activations Pass through an activation function h( ) to get output z j = h(a j ) Model of an individual neuron from Russell and Norvig, AIMA2e

10 Feed-forward Networks Starting with input x = (x 1,..., x D ), construct linear combinations: D a j = w (1) ji x i + w (1) j0 i=1 These a j are known as activations Pass through an activation function h( ) to get output z j = h(a j ) Model of an individual neuron from Russell and Norvig, AIMA2e

11 Feed-forward Networks Starting with input x = (x 1,..., x D ), construct linear combinations: D a j = w (1) ji x i + w (1) j0 i=1 These a j are known as activations Pass through an activation function h( ) to get output z j = h(a j ) Model of an individual neuron from Russell and Norvig, AIMA2e

12 Activation Functions Can use a variety of activation functions Sigmoidal (S-shaped) Logistic sigmoid 1/(1 + exp( a)) (useful for binary classification) Hyperbolic tangent tanh Radial basis function z j = i (x i w ji ) 2 Softmax Useful for multi-class classification Identity Useful for regression Threshold... Needs to be differentiable for gradient-based learning (later) Can use different activation functions in each unit

13 Feed-forward Networks x D hidden units z M w (1) MD w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0 Connect together a number of these units into a feed-forward network (DAG) Above shows a network with one layer of hidden units Implements function: ) M y k (x, w) = h j=1 w (2) kj h ( D i=1 w (1) ji x i + w (1) j0 + w (2) k0

14 Hidden Units Compute Basis Functions red dots = network function dashed line = hidden unit activation function. blue dots = data points

15 Outline Feed-forward Networks Network Training Error Backpropagation Applications

16 Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it For regression, training data are (x n, t), t n R Squared error naturally arises: E(w) = N {y(x n, w) t n } 2 For binary classification, this is another discriminative model, ML: p(t w) = N n=1 E(w) = n=1 y tn n {1 y n } 1 tn N {t n ln y n + (1 t n ) ln(1 y n )} n=1

17 Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it For regression, training data are (x n, t), t n R Squared error naturally arises: E(w) = N {y(x n, w) t n } 2 For binary classification, this is another discriminative model, ML: p(t w) = N n=1 E(w) = n=1 y tn n {1 y n } 1 tn N {t n ln y n + (1 t n ) ln(1 y n )} n=1

18 Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it For regression, training data are (x n, t), t n R Squared error naturally arises: E(w) = N {y(x n, w) t n } 2 For binary classification, this is another discriminative model, ML: p(t w) = N n=1 E(w) = n=1 y tn n {1 y n } 1 tn N {t n ln y n + (1 t n ) ln(1 y n )} n=1

19 Parameter Optimization E(w) w A w B wc w 1 w 2 E For either of these problems, the error function E(w) is nasty Nasty = non-convex Non-convex = has local minima

20 Descent Methods The typical strategy for optimization problems of this sort is a descent method: w (τ+1) = w (τ) + w (τ) These come in many flavours Gradient descent E(w (τ) ) Stochastic gradient descent E n (w (τ) ) Newton-Raphson (second order) 2 All of these can be used here, stochastic gradient descent is particularly effective Redundancy in training data, escaping local minima

21 Descent Methods The typical strategy for optimization problems of this sort is a descent method: w (τ+1) = w (τ) + w (τ) These come in many flavours Gradient descent E(w (τ) ) Stochastic gradient descent E n (w (τ) ) Newton-Raphson (second order) 2 All of these can be used here, stochastic gradient descent is particularly effective Redundancy in training data, escaping local minima

22 Descent Methods The typical strategy for optimization problems of this sort is a descent method: w (τ+1) = w (τ) + w (τ) These come in many flavours Gradient descent E(w (τ) ) Stochastic gradient descent E n (w (τ) ) Newton-Raphson (second order) 2 All of these can be used here, stochastic gradient descent is particularly effective Redundancy in training data, escaping local minima

23 Computing Gradients The function y(x n, w) implemented by a network is complicated It isn t obvious how to compute error function derivatives with respect to weights Numerical method for calculating error derivatives, use finite differences: E n E n(w ji + ɛ) E n (w ji ɛ) 2ɛ How much computation would this take with W weights in the network? O(W) per derivative, O(W 2 ) total per gradient descent step

24 Computing Gradients The function y(x n, w) implemented by a network is complicated It isn t obvious how to compute error function derivatives with respect to weights Numerical method for calculating error derivatives, use finite differences: E n E n(w ji + ɛ) E n (w ji ɛ) 2ɛ How much computation would this take with W weights in the network? O(W) per derivative, O(W 2 ) total per gradient descent step

25 Computing Gradients The function y(x n, w) implemented by a network is complicated It isn t obvious how to compute error function derivatives with respect to weights Numerical method for calculating error derivatives, use finite differences: E n E n(w ji + ɛ) E n (w ji ɛ) 2ɛ How much computation would this take with W weights in the network? O(W) per derivative, O(W 2 ) total per gradient descent step

26 Computing Gradients The function y(x n, w) implemented by a network is complicated It isn t obvious how to compute error function derivatives with respect to weights Numerical method for calculating error derivatives, use finite differences: E n E n(w ji + ɛ) E n (w ji ɛ) 2ɛ How much computation would this take with W weights in the network? O(W) per derivative, O(W 2 ) total per gradient descent step

27 Outline Feed-forward Networks Network Training Error Backpropagation Applications

28 Error Backpropagation Backprop is an efficient method for computing error derivatives En O(W) to compute derivatives wrt all weights First, feed training example x n forward through the network, storing all activations a j Calculating derivatives for weights connected to output nodes is easy e.g. For linear output nodes y k = i w kiz i : E n = 1 w ki w ki 2 (y n t n ) 2 = (y n t n )x ni For hidden layers, propagate error backwards from the output nodes

29 Error Backpropagation Backprop is an efficient method for computing error derivatives En O(W) to compute derivatives wrt all weights First, feed training example x n forward through the network, storing all activations a j Calculating derivatives for weights connected to output nodes is easy e.g. For linear output nodes y k = i w kiz i : E n = 1 w ki w ki 2 (y n t n ) 2 = (y n t n )x ni For hidden layers, propagate error backwards from the output nodes

30 Error Backpropagation Backprop is an efficient method for computing error derivatives En O(W) to compute derivatives wrt all weights First, feed training example x n forward through the network, storing all activations a j Calculating derivatives for weights connected to output nodes is easy e.g. For linear output nodes y k = i w kiz i : E n = 1 w ki w ki 2 (y n t n ) 2 = (y n t n )x ni For hidden layers, propagate error backwards from the output nodes

31 Chain Rule for Partial Derivatives A reminder For f (x, y), with f differentiable wrt x and y, and x and y differentiable wrt u and v: f u = and f x x u + f y y u f v = f x x v + f y y v

32 We can write Error Backpropagation E n = E n (a j1, a j2,..., a jm ) where {j i } are the indices of the nodes in the same layer as node j Using the chain rule: E n = E n a j a j + k E n a k a k where k runs over all other nodes k in the same layer as node j. Since a k does not depend on w ji, all terms in the summation go to 0 E n = E n a j a j

33 We can write Error Backpropagation E n = E n (a j1, a j2,..., a jm ) where {j i } are the indices of the nodes in the same layer as node j Using the chain rule: E n = E n a j a j + k E n a k a k where k runs over all other nodes k in the same layer as node j. Since a k does not depend on w ji, all terms in the summation go to 0 E n = E n a j a j

34 We can write Error Backpropagation E n = E n (a j1, a j2,..., a jm ) where {j i } are the indices of the nodes in the same layer as node j Using the chain rule: E n = E n a j a j + k E n a k a k where k runs over all other nodes k in the same layer as node j. Since a k does not depend on w ji, all terms in the summation go to 0 E n = E n a j a j

35 Error Backpropagation cont. Introduce error δ j En a j E n = δ j a j Other factor is: a j = w jk z k = z i k

36 Error Backpropagation cont. Error δ j can also be computed using chain rule: δ j E n a j = k E n a k a }{{} k a j δ k where k runs over all nodes k in the layer after node j. Eventually: δ j = h (a j ) w kj δ k k A weighted sum of the later error caused by this weight

37 Error Backpropagation cont. Error δ j can also be computed using chain rule: δ j E n a j = k E n a k a }{{} k a j δ k where k runs over all nodes k in the layer after node j. Eventually: δ j = h (a j ) w kj δ k k A weighted sum of the later error caused by this weight

38 Outline Feed-forward Networks Network Training Error Backpropagation Applications

39 Applications of Neural Networks Many success stories for neural networks Credit card fraud detection Hand-written digit recognition Face detection Autonomous driving (CMU ALVINN)

40 Hand-written Digit Recognition MNIST - standard dataset for hand-written digit recognition training, test images

41 LeNet-5 INPUT 32x32 C1: feature maps C3: f. maps S4: f. maps S2: f. maps C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection LeNet developed by Yann LeCun et al. Convolutional neural network Local receptive fields (5x5 connectivity) Subsampling (2x2) Shared weights (reuse same 5x5 filter ) Breaking symmetry See

42 4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3 9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4 8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1 9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8 4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4 8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8 1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1 2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0 4!>9 2!>8 The 82 errors made by LeNet5 (0.82% test error rate)

43 Conclusion Readings: Ch. 5.1, 5.2, 5.3 Feed-forward networks can be used for regression or classification Similar to linear models, except with adaptive non-linear basis functions These allow us to do more than e.g. linear decision boundaries Different error functions Learning is more difficult, error function not convex Use stochastic gradient descent, obtain (good?) local minimum Backpropagation for efficient gradient computation

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of