Artificial Intelligence Learning and Neural Networks Readings: Chapter 19 & 20.5 of Russell & Norvig
Example: A Feed-forward Network w 13 I 1 H 3 w 35 w 14 O 5 I 2 w 23 w 24 H 4 w 45 a 5 = g 5 (W 3,5 a 3 + W 4,5 a 4 ) = g 5 (W 3,5 g 3 (W 1,3 a 1 + W 2,3 a 2 ) + W 4,5 g 4 (W 1,4 a 1 + W 2,4 a 2 )) where a i is the output and g i is the activation function of node i.
Computing with NNs Different functions are implemented by different network topologies and unit weights. The lure of NNs is that a network need not be explicitly programmed to compute a certain function f. Given enough nodes and links, a NN can learn the function by itself. It does so by looking at a training set of input/output pairs for f and modifying its topology and weights so that its own input/output behavior agrees with the training pairs. In other words, NNs learn by induction, too.
Learning = Training in Neural Networks Neural networks are trained using data referred to as a training set. The process is one of computing outputs, compare outputs with desired answers, adjust weights and repeat. The information of a Neural Network is in its structure, activation functions, weights, and Learning to use different structures and activation functions is very difficult. These weights are used to express the relative strength of an input value or from a connecting unit (i.e., in another layer). It is by adjusting these weights that a neural network learns.
Process for Developing Neural Networks 1. Collect data Ensure that application is amenable to a NN approach and pick data randomly. 2. Separate Data into Training Set and Test Set 3. Define a Network Structure Are perceptrons sufficient? 4. Select a Learning Algorithm Decided by available tools 5. Set Parameter Values They will affect the length of the training period. 6. Training Determine and revise weights 7. Test If not acceptable, go back to steps 1, 2,..., or 5. 8. Delivery of the product
The Perceptron Learning Method Weight updating in perceptrons is very simple because each output node is independent of the other output nodes. I j W j,i O i I j W j O Input Units Output Units Input Units Output Unit Perceptron Network Single Perceptron With no loss of generality then, we can consider a perceptron with a single output node.
Normalizing Unit Thresholds. Notice that, if t is the threshold value of the output unit, then step t ( n j=1 W j I j ) = step 0 ( n j=0 W j I j ) where W 0 = t and I 0 = 1. Therefore, we can always assume that the unit s threshold is 0 if we include the actual threshold as the weight of an extra link with a fixed input value. This allows thresholds to be learned like any other weight. Then, we can even allow output values in [0, 1] by replacing step 0 by the sigmoid function.
The Perceptron Learning Method If O is the value returned by the output unit for a given example and T is the expected output, then the unit s error is Err = T O If the error Err is positive we need to increase O; otherwise, we need to decrease O.
The Perceptron Learning Method Since O = g( n j=0 W ji j ), we can change O by changing each W j. Assuming g is monotonic, to increase O we should increase W j if I j is positive, decrease W j if I j is negative. Similarly, to decrease O we should decrease W j if I j is positive, increase W j if I j is negative. This is done by updating each W j as follows W j W j + α I j (T O) where α is a positive constant, the learning rate.
Theoretic Background Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 1 2 (y h W (x))2, Perform optimization search by gradient descent: E = Err Err = Err n y g( W j x j ) W j W j W j = Err g (in) x j j = 0
Weight update rule W j W j α E W j = W j + α Err g (in) x j E.g., positive error = increasing network output = increasing weights on positive inputs and decreasing on negative inputs Simple weight update rule (assuming g (in) constant): W j W j + α Err x j
Learning a 5-place Minority Function At first, collect the data (see below), then choose a structure (a perceptron with five inputs and one output) and the activation function (i.e., step 3 ). Finally, set up parameters (i.e., W i = 0) and start to learn: Assumping α = 1, we have Sum = 5 i=1 W ii i, Out = step 3 (Sum), Err = T Out, and W j W j + I j Err. I 1 I 2 I 3 I 4 I 5 T W 1 W 2 W 3 W 4 W 5 Sum Out Err e 1 1 0 0 0 1 1 0 0 0 0 0 e 2 1 1 0 1 0 0 e 3 0 0 0 1 1 1 e 4 1 1 1 1 0 0 e 5 0 1 0 1 0 1 e 6 0 1 1 1 1 0 e 7 0 1 0 1 0 1 e 8 1 0 1 0 0 1
Learning a 5-place Minority Function The same as the last example, except that α = 0.5 instead of α = 1. Sum = 5 i=1 W ii i, Out = step 3 (Sum), Err = T Out, and W j W j + I j Err. I 1 I 2 I 3 I 4 I 5 T W 1 W 2 W 3 W 4 W 5 Sum Out Err e 1 1 0 0 0 1 1 0 0 0 0 0 e 2 1 1 0 1 0 0 e 3 0 0 0 1 1 1 e 4 1 1 1 1 0 0 e 5 0 1 0 1 0 1 e 6 0 1 1 1 1 0 e 7 0 1 0 1 0 1 e 8 1 0 1 0 0 1
Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k
Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers h W (x 1, x 2 ) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-4 -2 0 x 1 2 4-4 -2 0 2 4 x 2 h W (x 1, x 2 ) 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-4 -2 0 x 1 2 4-4 -2 0 2 4 x 2
Back-propagation learning Output layer: same as for single-layer perceptron, W j,i W j,i + α a j i where i = Err i g (in i ) Hidden layer: back-propagate the error from the output layer: j = g (in j ) i W j,i i. Update rule for weights in hidden layer: W k,j W k,j + α a k j. (Most neuroscientists deny that back-propagation occurs in the brain)
Back-propagation derivation The squared error on a single example is defined as E = 1 2 (y i a i ) 2, where the sum is over the nodes in the output layer. i E W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i W j,i = (y i a i )g (in i ) W j,i j W j,i a j = (y i a i )g (in i )a j = a j i
Back-propagation derivation E W k,j = i = i (y i a i ) a i W k,j = i (y i a i )g (in i ) in i W k,j = i (y i a i ) g(in i) W k,j i W k,j j W j,i = i a j i W j,i = W k,j i i W j,i g(in j ) W k,j = i = i i W j,i g (in j ) in j W k,j i W j,i g (in j ) W k,j ( k W k,j a k ) = i W j,i g (in j )a k = a k j
Decision Trees for Classification Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where I will/won t wait for a table: Example Attributes Targe Bar F ri Hun P at P rice Rain Res T ype Est WillWa X 1 F F T Some $$$ F T French 0 10 T X 2 F F T Full $ F F Thai 30 60 F X 3 T F F Some $ F F Burger 0 10 T X 4 F T T Full $ F F Thai 10 30 T X 5 F T F Full $$$ F T French >60 F X 6 T F T Some $$ T T Italian 0 10 T X 7 T F F None $ T F Burger 0 10 F X 8 F F T Some $$ T T Thai 0 10 T X 9 T T F Full $ T F Burger >60 F X 10 T T T Full $$$ F T Italian 10 30 F X 11 F F F None $ F F Thai 0 10 F X 12 T T T Full $ F F Burger 30 60 T Classification of examples is positive (T) or negative (F)
Decision trees One possible representation for hypotheses. E.g., here is the true tree for deciding whether to wait: Patrons? None Some Full F T WaitEstimate? >60 30 60 10 30 0 10 F Alternate? Hungry? T No Yes No Yes Reservation? Fri/Sat? T Alternate? No Yes No Yes No Yes Bar? T F T T Raining? No Yes No Yes F T F T
Expressiveness Decision trees can express any function of the input attributes. E.g., for Boolean functions, each truth table row is a path from root to leaf: A B A xor B F F F F T T T F T T T F F F B F T A T T B T F T Trivially, a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won t generalize to new examples Prefer to find more compact decision trees. F
Hypothesis spaces How many distinct decision trees with n Boolean attributes
Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions
Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n
Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)
Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain) Each attribute can be in (positive), in (negative), or out.
Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain) Each attribute can be in (positive), in (negative), or out. 3 n distinct conjunctive hypotheses
Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose most significant attribute as root of (sub)tree Reason: A good attribute splits the examples into subsets that are (ideally) all positive or all negative Patrons? Type? None Some Full French Italian Thai Burger P atrons? is a better choice gives information about the classification
Information Information answers questions. The more clueless I am about the answer initially, the more information is contained in the answer. Scale: 1 = answer to Boolean question with prior 0.5, 0.5 0 = answer to Boolean question with known result. Information in an answer when prior is P 1,..., P n is I( P 1,..., P n ) = n i = 1 P i log 2 P i (also called entropy of the prior)
Information Suppose we have p positive and n negative examples at the root. Hence I( p/(p + n), n/(p + n) ) bits needed to classify a new example. E.g., for 12 restaurant examples and p = n = 6, we need 1 bit. Suppose an attribute A splits the examples E into subsets E i, each of which (we hope) needs less information to complete the classification. Let E i have p i positive and n i negative examples. Then I( p i /(p i + n i ), n i /(p i + n i ) ) bits needed to classify E i. The remaining bits after choosing A will be Remainder(A) = i p i + n i p + n I( p i/(p i + n i ), n i /(p i + n i ) ) Idea: Choose A such that Remainder(A) is minimal.
Information Patrons? Type? None Some Full French Italian Thai Burger Remainder(P atrons?) = 2 4 6 12I(0, 1) + 12I(1, 0) + 12 I( 2 6, 4 6 ) = 0.459 Remainder(T ype?) = 2 12 I( 1 2, 1 2 ) + 12 2 I( 1 2, 1 2 ) + 12 4 I( 2 4, 2 4 ) + 12 4 I( 2 4, 2 4 ) = 1. So we choose the attribute P atrons? that minimizes the remaining information.
Example contd. Decision tree learned from the 12 examples: Patrons? None Some Full F T Hungry? Yes No Type? F French Italian Thai Burger T F Fri/Sat? T No Yes F T Substantially simpler than true tree a more complex hypothesis isn t justified by small amount of data
Summary Learning needed for unknown environments, lazy designers Learning agent = performance element + learning element Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples Learning performance = prediction accuracy measured on test set Most brains have lots of neurons; each neuron linear threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, credit cards, etc.