Artificial Neural Networks

Size: px

Start display at page:

Download "Artificial Neural Networks"

Ilene Walsh
6 years ago
Views:

1 Artificial Neural Networks Oliver Schulte - CMPT 310

Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on statistical and

2 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on statistical and computational properties rather than plausibility An artificial neural network is a general function approximator The inner or hidden layers compute learned auxilliary functions

3 Uses of Neural Networks Pros Good for continuous input variables. General continuous function approximators. Highly non-linear. Trainable basis functions. Good to use in continuous domains with little knowledge: When you don t know good features. You don t know the form of a good functional model. Cons Not interpretable, black box. Learning is slow. Good generalization can require many datapoints.

4 Function Approximation Demos Home Value of Hockey State githubusercontent.com/ / eb64b49a-67bf-11e7-97aa f721e5. jpg Function Learning Examples (open in Safari) bpfunctionapprox/bpfunctionapprox.html

5 Applications There are many, many applications. World-Champion Backgammon Player. No Hands Across America Tour. www/nhaa/nhaa_home_page.html Digit Recognition with 99.26% accuracy. Speech Recognition features/speechrecognition aspx

6 Outline Feed-forward Networks Network Training Error Backpropagation Applications

7 Outline Feed-forward Networks Network Training Error Backpropagation Applications

8 No Hands Across America Sharp Left Straight Ahead Sharp Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina

9 Non-linear Activation Functions Pass input in j through a non-linear activation function g( ) to get output a j = g(in j ) Model of an individual neuron from Russell and Norvig, AIMA3e

10 Non-linear Activation Functions Pass input in j through a non-linear activation function g( ) to get output a j = g(in j ) Model of an individual neuron from Russell and Norvig, AIMA3e

11 Non-linear Activation Functions Pass input in j through a non-linear activation function g( ) to get output a j = g(in j ) Model of an individual neuron Bias Weight a 0 = 1 a j = g(in j ) a i wi,j w 0,j Σ in j g a j Input Links Input Function Activation Function Output Output Links from Russell and Norvig, AIMA3e

12 Network of Neurons 1 w 1,3 3 1 w 1,3 3 w 3,5 5 w 1,4 w 1,4 w 3,6 2 w 2,3 w 2,4 4 2 w 2,3 w 2,4 4 w 4,5 w 4,6 6 (a) (b)

13 Activation Functions Can use a variety of activation functions Sigmoidal (S-shaped) Logistic sigmoid 1/(1 + exp( a)) (useful for binary classification) Hyperbolic tangent tanh Softmax Useful for multi-class classification Rectified Linear Unit (RLU) max(0, x)... Should be differentiable for gradient-based learning (later) Can use different activation functions in each unit See

14 Function Composition Think logic circuits h W (x 1, x 2 ) x x 2 h W (x 1, x 2 ) x Two opposite-facing sigmoids = ridge. Two ridges = bump. 2 4 x 2

15 x 1 x 2 The XOR Problem Revisited x 2 1 z=-1 R 2 z= R 2 z=-1 R 1 x 1-1

16 The XOR Problem Solved 1 z x z k 0-1 x 1 1 output k y 1 y 2 x x 1 bias y 1 y 2 1 w kj hidden j w ji x 1 1 x 2 input i x 1 x 2

17 Hidden Units Compute Auxilliary Functions red dots = network function dashed line = hidden unit activation function. blue dots = data points Network function is roughly the sum of activation functions.

18 Hidden Units As Feature Extractors sample training patterns learned input-to-hidden weights FIGURE The top images represent patterns from a large training set used to train a 64 input nodes sigmoidal network for classifying three characters. The bottom figures show the input-to-hidden 2 hidden weights, unitsrepresented as patterns, at the two hidden units after training. Note that learned these learned weight weights matrix indeed at hidden describeunits feature groupings useful for the classification task. In large networks, such patterns of learned weights may be difficult to

19 Outline Feed-forward Networks Network Training Error Backpropagation Applications

20 Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it Training data are (x n, y n ) Corresponds to neural net with multiple output nodes Given a set of weight values w, the network defines a function h w (x). Can train by minimizing L2 loss: E(w) = N h w (x n ) y n ) 2 = n=1 where k indexes the output nodes N (y k a k ) 2 n=1 k

21 Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it Training data are (x n, y n ) Corresponds to neural net with multiple output nodes Given a set of weight values w, the network defines a function h w (x). Can train by minimizing L2 loss: E(w) = N h w (x n ) y n ) 2 = n=1 where k indexes the output nodes N (y k a k ) 2 n=1 k

22 Parameter Optimization E(w) w A w B wc w 1 w 2 E For either of these problems, the error function E(w) is nasty Nasty = non-convex Non-convex = has local minima

23 Gradient Descent The function h w (x) implemented by a network is complicated. No closed-form: Use gradient descent. It isn t obvious how to compute error function derivatives with respect to hidden weights. The credit assignment problem. Backpropagation solves the credit assignment problem

24 Outline Feed-forward Networks Network Training Error Backpropagation Applications

25 Error Backpropagation Backprop is an efficient method for computing error derivatives E w ij for all weights in the network. Intuition: 1. Calculating derivatives for weights connected to output nodes is easy. 2. Treat the derivatives as virtual error, compute derivative of error for nodes in previous layer. 3. Repeat until you reach input nodes. This procedure propagates backwards the output error signal through the network. Stochastic Gradient Descent: Fix input x x n and target output y y n, resulting in error E n.

26 Error at the output nodes First, feed training example x n forward through the network, storing all node activations a i Calculating derivatives for weights connected to output nodes is easy. like logistic regression with input features a i For output node j with activation a j = g(in j ) = g( i w ija i ): E n w ij = 1 w ij 2 (y j a j ) 2 = a j g (in j ) (y j a j ) 0 if no error, or if input a i from node i is 0. Modified Error: [j] g (in j )(y j a j ). Gradient Descent Weight Update: w ij w ij + α a i [j]

27 Error at the output nodes First, feed training example x n forward through the network, storing all node activations a i Calculating derivatives for weights connected to output nodes is easy. like logistic regression with input features a i For output node j with activation a j = g(in j ) = g( i w ija i ): E n w ij = 1 w ij 2 (y j a j ) 2 = a j g (in j ) (y j a j ) 0 if no error, or if input a i from node i is 0. Modified Error: [j] g (in j )(y j a j ). Gradient Descent Weight Update: w ij w ij + α a i [j]

28 Error at the hidden nodes Consider a hidden node i connected to downstream nodes in the next layer. The modified error signal [i] is node activation derivative, times the weighted sum of contributions to the connected errors. In symbols, [i] = g (in i ) j w ij [j].

29 backprop with new nota/on Backpropagation Picture output ω 1 ω 2 ω 3 ω k ω c δ 1 δ 2 δ 3 wj3 δ k δ c w kj hidden δ j w ij input The error signal at a hidden unit is proportional to the error signals at the units it influences: [j] = g (in j ) k w jk [k].

30 The Backpropagation Algorithm 1. Apply input vector x n and forward propagate to find all inputs in i and activation output levels a i. 2. Evaluate the error signals [j] for all output nodes. 3. Backpropagate the [j] to obtain error signals [i] for each hidden node i. 4. Update each weight vector w ij using w ij := w ij + α a i [j]. Demo AIspace

31 Other Learning Topics Regularization: L2-regularizer (weight decay). Prune Weights: the Optimal Brain Method. Experimenting with Network Architectures is often key.

32 Outline Feed-forward Networks Network Training Error Backpropagation Applications

33 Applications of Neural Networks Many success stories for neural networks Credit card fraud detection Hand-written digit recognition Face detection Autonomous driving (CMU ALVINN)

34 Hand-written Digit Recognition MNIST - standard dataset for hand-written digit recognition training, test images

35 LeNet-5 INPUT 32x32 C1: feature maps C3: f. maps S4: f. maps S2: f. maps C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection LeNet developed by Yann LeCun et al. Convolutional neural network Local receptive fields (5x5 connectivity) Subsampling (2x2) Shared weights (reuse same 5x5 filter ) Breaking symmetry See

36 4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3 9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4 8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1 9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8 4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4 8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8 1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1 2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0 4!>9 2!>8 The 82 errors made by LeNet5 (0.82% test error rate)

37 Conclusion Feed-forward networks can be used for regression or classification Learning is more difficult, error function not convex Use stochastic gradient descent, obtain (good?) local minimum Backpropagation for efficient gradient computation

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch. Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will