Ch.8 Neural Networks

Size: px

Start display at page:

Download "Ch.8 Neural Networks"

Barbra Beasley
5 years ago
Views:

1 Ch.8 Neural Networks Hantao Zhang hzhang/c145 The University of Iowa Department of Computer Science Artificial Intelligence p.1/??

2 Brains as Computational Devices Motivation: Algorithms developed over centuries do not fit the complexity of real world problem. The human brain is the most sophisticated computer suitable for solving extremely complex problems. Reasonable Size: neurons (neural cells) and only a small portion of these cells are used Simple Building Blocks: No cell contains too much information. Massively parallel: Each region of the brain controls specialized tasks. Fault-tolerant: Information is saved mainly in the connections among neurons Reliable Graceful degradation Artificial Intelligence p.2/??

3 Comparing Brains with Computers Computer Human Brain Computational units 1 CPU, 10 5 gates neurons Storage units 10 9 bits RAM, bits disk neurons, synapses Cycle time 10 8 sec 10 3 sec Bandwidth 10 9 bits/sec bits/sec Neuron updates/sec Even if a computer is one million times faster than a brain in raw speed, a brain ends up being one billion times faster than a computer at what it does. Example: Recognizing a face Brain: Computer: < 1s (a few hundred computer cycles) billions of cycles Artificial Intelligence p.3/??

4 Biological System Axonal arborization Synapse Axon from another cell Dendrite Axon Nucleus Synapses Cell body or Soma A neuron does nothing until the collective influence of all its inputs reaches a threshold level. At that point, the neuron produces a full-strength output in the form of a narrow pulse that proceeds from the cell body, down the axon, and into the axon s branches. It fires! : Since it fires or does nothing it is considered an all or nothing device. Increases or decreases the strength of connection and causes excitation or inhibition of a subsequent neuron Artificial Intelligence p.4/??

5 Analogy from Biology a j Input Links Wj,i in i Σ a i = g(in i ) g a i Output Links Input Function Activation Function Output Artificial neurons are viewed as a node connected to other nodes via links that correspond to neural connections. Each link is associated with a weight. The weight determines the nature (+/-) and strength of the node s influence on another. If the influence of all the links is strong enough the node is activate (similar to the firing of a neuron). Artificial Intelligence p.5/??

6 A Neural Network Unit Artificial Intelligence p.6/??

7 Artificial Neural Network A neural network is a graph of nodes (or units), connected by links. Each link has an associated weight, a real number. Typically, each node i has several incoming links and several outgoing links. Each incomping link provides a real number as input to the node and the node sends one real number through every outgoing link. The output of a node is a function of the weighted sum of the node s inputs. Artificial Intelligence p.7/??

8 The Input Function Each incoming link of a unit i feeds an input value, or activation value, a j coming from another unit. The input function in i of a unit is simply the weighted sum of the unit s input: in i (a 1,...,a ni ) = n i j=1 W j,i a j The unit applies the activation function g i to the result of in i to produce an output: n i out i = g i (in i ) = g i ( j=1 W j,i a j ) Artificial Intelligence p.8/??

9 Typical Activation Functions a i a i a i t in i in i in i 1 (a) Step function step t (x) = 1, if x t 0, if x < t (b) Sign function sign(x) = +1, if x 0 1, if x < 0 (c) Sigmoid function sig(x) = 1 1+e x Artificial Intelligence p.9/??

10 Typical Activation Functions 2 Hard limiter (binary step) f θ (x) = 1, if x > θ 0, if θ x θ 1, if x < θ Binary sigmoid (exponential sigmoid) sig(x) = e σx where σ controls the saturation of the curve. When σ, hard limiter is achieved. Bipolar sigmoid (atan) f(x) = tan 1 (σx) Artificial Intelligence p.10/??

11 Units as Logic Gates W = 1 W = 1 W = 1 W = 1 t = 1.5 t = 0.5 t = 0.5 W = 1 AND OR NOT Activation function: step t Since units can implement the,, boolean operators, neural nets are Turing-complete: they can implement any computable function. Artificial Intelligence p.11/??

12 Structures of Neural Networks Directed: Acyclic: Feed-Forward: Multi Layers: Nodes are grouped into layers and all links are from one layer to the next layer. Single Layer: Each node sends its output out of the network. Tree:... Arbitrary Feed:... Cyclic:... Undirected:... Artificial Intelligence p.12/??

13 Multilayer, Feed-forward Networks A kind of neural network in which links are directional and form no cycles (the net is a directed acyclic graph); the root nodes of the graph are input units, their activation value is determined by the environment; the leaf nodes are output units; the remaining nodes are hidden units; units can be divided into layers: a unit in a layer is connected only to units in the next layer. Artificial Intelligence p.13/??

14 A Two-layer, Feed-forward Network Output units O i W j,i Hidden units a j W k,j Input units I k Notes: The roots of the graph are at the bottom and the (only) leaf at the top. The layer of input units is generally not counted (which is why this is a two-layer net). Artificial Intelligence p.14/??

15 Example w 13 I 1 H 3 w 35 w 14 O 5 I 2 w 23 w 24 H 4 w 45 a 5 = g 5 (W 3,5 a 3 + W 4,5 a 4 ) = g 5 (W 3,5 g 3 (W 1,3 a 1 + W 2,3 a 2 ) + W 4,5 g 4 (W 1,4 a 1 + W 2,4 a 2 )) where a i is the output and g i is the activation function of node i. Artificial Intelligence p.15/??

16 Multilayer, Feed-forward Networks A powerful computational device: with just one hidden layer, they can approximate any continuous function; with just two hidden layers, they can approximate any computable function. However, the number of needed units per layer may grow exponentially with the number of the input units. Artificial Intelligence p.16/??

17 Perceptrons Single-layer, feed-forward networks whose units use a step function as activation function. I j W j,i O i I j W j O Input Units Output Units Input Units Output Unit Perceptron Network Single Perceptron Artificial Intelligence p.17/??

18 Perceptrons Perceptrons caused a great stir when they were invented because it was shown that If a function is representable by a perceptron, then it is learnable with 100% accuracy, given enough training examples. The problem is that perceptrons can only represent linearly-separable functions. Artificial Intelligence p.18/??

19 Linearly Separable Functions On a 2-dimensional space: I 1 I 1 I ? I 2 I I 2 I 1 I 2 I 1 I 2 (a) and (b) or (c) xor I 1 I 2 A black dot corresponds to an output value of 1. An empty dot corresponds to an output value of 0. Artificial Intelligence p.19/??

20 How to Represent XOR function by NN w 13 I 1 H 3 w 35 w 14 O 5 I 2 w 23 w 24 H 4 w 45 a 5 = g 5 (W 3,5 a 3 + W 4,5 a 4 ) = g 5 (W 3,5 g 3 (W 1,3 a 1 + W 2,3 a 2 ) + W 4,5 g 4 (W 1,4 a 1 + W 2,4 a 2 )) where a i is the output, g i = step 0.5 is the activation function of node i, W 13 = W 24 = W 35 = W 45 = 1, W 14 = W 23 = 1. Artificial Intelligence p.20/??

21 A Linearly Separable Function On a 3-dimensional space: The minority function: Return 1 if the input vector contains less ones than zeros. Return 0 otherwise. I 1 W = 1 I 2 W = 1 t = 1.5 W = 1 I 3 (a) Separating plane (b) Weights and threshold Artificial Intelligence p.21/??

22 Computing with a 2-layer NN Artificial Intelligence p.22/??

23 Computing with a 2-layer NN #define NUM_INPUTS 4 #define NUM_HIDDEN_NEURONS 4 #define NUM_OUTPUT_NEURONS 3 typedef mlp_s { /* Inputs to the MLP (+1 for bias) */ double inputs[num_inputs+1]; /* Weights from Hidden to Input Layer (+1 for bias) */ double w_h_i[num_hidden_neurons+1][num_inputs+1]; /* Hidden layer */ double hidden[num_hidden+1]; /* Weights from Output to Hidden Layer (+1 for bias) */ double w_o_h[num_output_neurons][num_hidden_neurons+1]; /* Outputs of the MLP */ double outputs[num_output_neurons]; } mlp_t; double step(double x) { if (x > 0.0) return 1.0; else return 0.0; } Artificial Intelligence p.23/??

24 Computing with a 2-layer NN void feed_forward( mlp_t *mlp ) { int i, h, out; /* Feed the inputs to the hidden layer through the hidden to input weights. */ for ( h = 0 ; h < NUM_HIDDEN_NEURONS ; h++ ) { mlp->hidden[h] = 0.0; for ( i = 0 ; i < NUM_INPUT_NEURONS+1 ; i++ ) { mlp->hidden[h] += ( mlp->inputs[i] * mlp->w_h_i[h][i] ); } mlp->hidden[h] = step( mlp->hidden[h] ); } /* Feed the hidden layer activations to the output layer * through the output to hidden weights. */ for( out = 0 ; out < NUM_OUTPUT_NEURONS ; out++ ) { mlp->output[out] = 0.0; for ( h = 0 ; h < NUM_HIDDEN_NEURONS ; h++ ) { mlp->outputs[out] += ( mlp->hidden[h] * mlp->w_o_h[out][h] ); } mlp->outputs[out] = step( mlp->outputs[out] ); }} Artificial Intelligence p.24/??

25 Applications of Neural Networks Signal and Image Processing Signal prediction (e.g., weather prediction) Adaptive noise cancellation Satellite image analysis Multimedia processing Bioinformatics Functional classification of protein and genes Clustering of genesbased on DNA microarray data Artificial Intelligence p.25/??

26 Applications of Neural Networks Astronomy Classification of objects (stars and galaxies) Compression of astronomical data Finance and Marketing Stock market prediction Fraud detection Loan approval Product bundling Strategic planning Artificial Intelligence p.26/??

27 Computing with NNs Different functions are implemented by different network topologies and unit weights. The lure of NNs is that a network need not be explicitly programmed to compute a certain function f. Given enough nodes and links, a NN can learn the function by itself. It does so by looking at a training set of input/output pairs for f and modifying its topology and weights so that its own input/output behavior agrees with the training pairs. In other words, NNs learn by induction, too. Artificial Intelligence p.27/??

28 Learning = Training in NN Neural networks are trained using data referred to as a training set. The process is one of computing outputs, compare outputs with desired answers, adjust weights and repeat. The information of a Neural Network is in its structure, activation functions, weights, and Learning to use different structures and activation functions is very difficult. These weights are used to express the relative strength of an input value or from a connecting unit (i.e., in another layer). It is by adjusting these weights that a neural network learns. Artificial Intelligence p.28/??

29 Process for Developing NN 1. Collect data Ensure that application is amenable to a NN approach and pick data randomly. 2. Separate Data into Training Set and Test Set 3. Define a Network Structure Are perceptrons sufficient? 4. Select a Learning Algorithm Decided by available tools 5. Set Parameter Values They will affect the length of the training period. 6. Training Determine and revise weights 7. Test If not acceptable, go back to steps 1, 2,..., or Delivery of the product Artificial Intelligence p.29/??

30 The Perceptron Learning Method Weight updating in perceptrons is very simple because each output node is independent of the other output nodes. I j W j,i O i I j W j O Input Units Output Units Input Units Output Unit Perceptron Network Single Perceptron With no loss of generality then, we can consider a perceptron with a single output node. Artificial Intelligence p.30/??

31 Normalizing Unit Thresholds Notice that, if t is the threshold value of the output unit, then step t ( n W j I j ) = step 0 ( j=1 where W 0 = t and I 0 = 1. n W j I j ) Therefore, we can always assume that the unit s threshold is 0 if we include the actual threshold as the weight of an extra link with a fixed input value. j=0 This allows thresholds to be learned like any other weight. Then, we can even allow output values in [0, 1] by replacing step 0 by the sigmoid function. Artificial Intelligence p.31/??

32 The Perceptron Learning Method If O is the value returned by the output unit for a given example and T is the expected output, then the unit s error is Err = T O If the error Err is positive we need to increase O; otherwise, we need to decrease O. Artificial Intelligence p.32/??

33 The Perceptron Learning Method Since O = g( n j=0 W ji j ), we can change O by changing each W j. Assuming g is monotonic, to increase O we should increase W j if I j is positive, decrease W j if I j is negative. Similarly, to decrease O we should decrease W j if I j is positive, increase W j if I j is negative. This is done by updating each W j as follows W j W j + α I j (Err) where α is a positive constant, the learning rate, and Err = T O Artificial Intelligence p.33/??

34 Theoretic Background Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err2 = 1 2 (y h W(x)) 2, Perform optimization search by gradient descent: E = Err Err = Err n y g( W j x j ) W j W j W j = Err g (in) x j j=0 Artificial Intelligence p.34/??

35 Weight update rule W j W j α E W j = W j + α Err g (in) x j E.g., positive error = increasing network output = increasing weights on positive inputs and decreasing on negative inputs. Simple weight update rule (assuming g (in) constant): W j W j + α Err x j Artificial Intelligence p.35/??

36 A 5-place Minority Function At first, collect the data (see below), then choose a structure (a perceptron with five inputs and one output) and the activation function (i.e., step 3 ). Finally, set up parameters (i.e., W i = 0) and start to learn: Assumping α = 1, we have Sum = 5 i=1 W ii i, Out = step 3 (Sum), Err = T Out, and W j W j + I j Err. I 1 I 2 I 3 I 4 I 5 T W 1 W 2 W 3 W 4 W 5 Sum Out Err e e e e e e e e Artificial Intelligence p.36/??

37 A 5-place Minority Function The same as the last example, except that α = 0.5 instead of α = 1, and initial W i. Sum = 5 i=1 W ii i, Out = step 3 (Sum), Err = T Out, and W j W j + I j Err. I 1 I 2 I 3 I 4 I 5 T W 1 W 2 W 3 W 4 W 5 Sum Out Err e e e e e e e e Artificial Intelligence p.37/??

38 Multilayer Perceptrons (MLP) Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k All continuous functions with 2 layers, all functions with 3 layers Artificial Intelligence p.38/??

39 Sigmoid Function in MLP g(x) = e cx As c goes bigger, the curve becomes sharper: Artificial Intelligence p.39/??

Sigmoid Function in Perceptron g(x) = 1 1 + e cx Only

40 Sigmoid Function in Perceptron g(x) = e cx Only linearly-separable functions: Artificial Intelligence p.40/??

41 Sigmoid Function in Two Layer MLP g(x) = e cx Two hidden nodes produce ridge-like functions: h W (x 1, x 2 ) x x 2 Artificial Intelligence p.41/??

42 Sigmoid Function in Two Layer MLP g(x) = e cx Four hidden nodes produce bump-like functions: h W (x 1, x 2 ) x x 2 Artificial Intelligence p.42/??

43 Errors with Sigmoid Functions E W j = Err g (in) x j W j W j α E W j = W j + α Err g (in) x j Assuming g(x) = g (x) = 1 (1 + e x ) e x (1 + e x = g(x)(1 g(x)) ) 2 Eq (8.8) on the page 267 of the textbook, g (u) = u(1 u), is a typo. Artificial Intelligence p.43/??

44 Back-propagation Learning 1. Phase 1: Propagation (a) Forward propagation of a training example s input to get the output O (b) Backward propagation of the output error to generate the deltas of all neural nodes (neurons). 2. Phase 2: Weight update (a) Multiply its output delta and inputs activation to get the gradient of the weight. (b) Bring the wright in the opposite direction of the gradient by substracting a ratio of it from the weight. (Most neuroscientists deny that back-propagation occurs in the brain) Artificial Intelligence p.44/??

45 Back-propagation Learning Output layer: similar to the single-layer perceptron, let i = Err i g (in i ) (called E O in (Eq 8.6)). W ji W ji + α a j i Hidden layer: back-propagate the error from the output layer, j = ( i W ji i ) g (in j ). (called E h in (Eq 8.7) which has typos). The update rule is identical: W kj W kj + α a k j. Artificial Intelligence p.45/??

46 Back-propagation Learning Initialize the weights in the network (often randomly while (stopping criterion not meet) For each example e in the training set O = neural-net-output(network, e); \\ forward ph T = teacher output for e Calculate err = (T - O) at the output units \\ backward phase Compute delta_wh for all weights from hidden lay to output layer; Compute delta_wi for all weights from input laye to hidden layer; Update the weights in the network Return the network Artificial Intelligence p.46/??

47 Learning XOR Function Assuming α = 0.5, g = 1; S 1 = u u 1 + v v 1, a 1 = step 0.5 (S 1 ), S 2 = u u 2 + v v 2, a 2 = step 0.5 (S 2 ), S 3 = a 1 w 1 + a 2 w 2, Out = step 0.5 (S 3 ), Err = T Out, w j w j + αa j Err for j = 1,2, u i u i + αu w 1 Err, and v i v i + αv w 2 Err, for i = 1,2. Artificial Intelligence p.48/??

48 Learning XOR Function Assuming α = 0.5, g = 1; S 1 = u u 1 + v v 1, a 1 = step 0.5 (S 1 ), S 2 = u u 2 + v v 2, a 2 = step 0.5 (S 2 ), S 3 = a 1 w 1 + a 2 w 2, Out = step 0.5 (S 3 ), Err = T Out, w j w j + αa j Err for j = 1,2, u i u i + αu w 1 Err, and v i v i + αv w 2 Err, for i = 1,2. u v T u 1 u 2 v 1 v 2 w 1 w 2 S 1 a 1 S 2 a 2 S 3 Out Err e e e e e Artificial Intelligence p.49/??

49 Handwriting Digits The Neural Network has 35 input nodes for such an image. The NN is trained by these perfect examples many times. Artificial Intelligence p.50/??

50 Handwriting Digits The Neural Network has 10 hidden nodes, fully connected to all the input nodes (350 edges), and fully connected to all 10 output nodes (100 edges). Each output node represents one type of digits. The final result is decided by the maximum output node (winner takes all). Artificial Intelligence p.51/??

51 Neural Network Summary Advantages Easy to adapt to unknown situations Robustness: fault tolerance due to network redundancy Autonomous learning and generalization Disadvantages Poor accuracy Large complexity of the network structure Over-fitting to training examples (cannot generalize well) The solution is a black box (no insights into the problem) Artificial Intelligence p.52/??

52 Nearest Neighbor Classification Compute the distances from the input to all the examples; Choose k examples which are nearest neighbors of the input; Decide the majority class of these k neighbors. Question: Can this technique of classification be regarded as neural network? Artificial Intelligence p.53/??

53 Probabilistic Neural Network (PNN) Compute the distances from the input to all the examples; Cumulate the normalized distances for each class of the examples; Decide the majority class of the normalized distances for each class (winner takes all). Artificial Intelligence p.54/??

54 Probabilistic Neural Network (PNN) Compute the distances from the input to all the examples: h i = E i F = n j=1 e ija j (Eq 8.13), where E i = (e i1,e i2,...,e in ) is an example and F = (a 1,a 2,...,a n ) is the input. Some people also use Euclidean distance: h i = n j=1 (e ij a j ) 2. Cumulate the normalized distances for each class of the examples: Initially c j := 0, c j +=e (h i 1)/γ 2 /N j for each h i, where c j is the accumulated normalized distance for class j, and N j is the number of examples of class j. γ is the smoothing factor. This normalized distance is called normalized Radial-Basis Form (RBF) in Probability Theory (a kind of probability density). Other normalized distances are also used in practice. Decide the majority class of the normalized distances for each class (winner takes all). Artificial Intelligence p.55/??

55 Summary Learning needed for unknown environments, lazy designers Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples Learning performance = prediction accuracy measured on test set Many applications: speech, driving, handwriting, credit cards, etc. Artificial Intelligence p.56/??

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1 Neural Networks Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1 Brains as Computational Devices Brains advantages with respect to digital computers: Massively parallel Fault-tolerant Reliable