Slide04 Haykin Chapter 4: Multi-Layer Perceptrons

Size: px

Start display at page:

Download "Slide04 Haykin Chapter 4: Multi-Layer Perceptrons"

Rosalyn Patrick
5 years ago
Views:

1 Introduction Slide4 Hayin Chapter 4: Multi-Layer Perceptrons CPSC Instructor: Yoonsuc Choe Spring 28 Networs typically consisting of input, hidden, and output layers. Commonly referred to as Multilayer perceptrons. Popular learning algorithm is the error bacpropagation algorithm (bacpropagation, or bacprop, for short), which is a generalization of the LMS rule. Some materials from this lecture are from Mitchell (997) Machine Learning, McGraw-Hill. Forward pass: activate the networ, layer by layer Bacward pass: error signal bacpropagates from output to hidden and hidden to input, based on which weights are updated. 2 Multilayer Perceptrons: Characteristics Multilayer Networs Each model neuron has a nonlinear activation function, typically a logistic function: yj +exp( v ) j Networ contains one or more hidden layers (layers that are not either an input or an output layer). Differentiable threshold unit: sigmoid φ(v) property: dφ(v) dv φ(v)( φ(v)). Output: y φ(xt w) Other functions: tanh(v) exp( 2v) exp( 2v)+ Networ exhibits a high degree of connectivity Interesting +exp( v)

2 Multilayer Networs and Bacpropagation head hid who d hood F F2 Nonlinear decision surfaces. Output Input sigm(x+y-.) (a) One output Another example: OR.6.8 Input Output.2.4 Input sigm(sigm(x+y-.)+sigm(-x-y+.3)-) Input 2 (b) Two hidden, one output Error Gradient for a Single Sigmoid Unit For n input-output pairs {(x, d )} n : 2 2 (d y ) 2 (d y ) 2 2(d y ) (d y ) 2 w i (d y ) y «w i (d y ) y Chain rule 6 From the previous page: But we now: So: Error Gradient for a Sigmoid Unit (d y ) y y φ(v ) y ( y ) (xt w) x i, (d y )y ( y )x i, Bacpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do For each training example, Do. Input the training example to the networ and compute the networ outputs 2. For each output unit j δ j y j ( y j )(d j y j ) 3. For each hidden unit h δ h y h ( y h ) P j outputs w jhδ j 4. Update each networ weight w i,j w ji w ji + w ji where w ji ηδ j x i. Note: w ji is the weight from i to j (i.e., w j i ). 7 8

3 For output unit: For hidden unit: The δ Term δ j y j ( y j ) (d j y j ) φ (v j ) Error δ h y h ( y h ) w jh δ j φ j outputs (v h ) Bacpropagated error In sum, δ is the derivative times the error. Derivation to be presented later. Want to update weight as: where error is defined as Derivation of w E(w) 2 Given v j P j w jix i, w ji η, j outputs Different formula for output and hidden. (d j y j ) 2 9 Derivation of w: Output Unit Weights From the previous page, First, calculate : 2 j outputs 2 (d j y j ) 2 (d j y j ) (d j y j ) (d j y j ) (d j y j ) Derivation of w: Output Unit Weights From the previous page, Next, calculate : Since y j φ(v j ), and φ (v j ) y j ( y j ), Putting everything together, y j ( y j ). (d j y j ) : (d j y j )y j ( y j ). 2

4 Derivation of w: Output Unit Weights From the previous page: Since (d j y j )y j ( y j ). P i w ji x i x, ji (d j y j )y j ( y j ) δ j error φ (net) x i {z} input Start with Derivation of w: Hidden Unit Weights x i : δ δ δ w j δ w j y j ( y j ) φ (net) () 3 4 Derivation of w: Hidden Unit Weights Summary Finally, given and x i, δ w j y j ( y j ), φ (net) w ji η η [y j ( y j ) φ (net) δ w j] error δ j x i w ji (n) η {z} δ j (n) y i (n) weight correction learning rate local gradient input signal 5 6

5 Extension to Different Networ Topologies w j j w ji i output hidden input Arbitrary number of layers: for neurons in layer m: δ r y r ( y r ) Arbitrary acyclic graph: δ r y r ( y r ) s layer m+ w sr δs. w sr δs. s Downstream(r) Bacpropagation: Properties Gradient descent over entire networ weight vector. Easily generalized to arbitrary directed graphs. Will find a local, not necessarily global error minimum: In practice, often wors well (can run multiple times with different initial weights). Minimizes error over training examples: Will it generalize well to subsequent examples? Training can tae thousands of iterations slow! Using the networ after training is very fast. 7 8 Learning Rate and Momentum Tradeoffs regarding learning rate: Smaller learning rate: smoother trajectory but slower convergence Larger learning rate: fast convergence, but can become unstable. Momentum can help overcome the issues above. w ji (n) ηδ j (n)y i (n) + α w ji (n ). The update rule can be written as: w ji (n) η n α n t δ j (t)y i (t) η t n t α n t (t) (t). Momentum (cont d) w ji (n) n t α n t (t) (t) The weight vector is the sum of an exponentially weighted time series. Behavior: When successive (t) tae the same sign: (t) Weight update is accelerated (speed up downhill). When successive (t) have different signs: (t) Weight update is damped (stabilize oscillation). 9 2

6 Sequential (online) vs. Batch Training Sequential mode: Update rule applied after each input-target presentation. Order of presentation should be randomized. Benefits: less storage, stochastic search through weight space helps avoid local minima. Disadvantages: hard to establish theoretical convergence conditions. Batch mode: Representational Power of Feedforward Networs Boolean functions: every boolean function representable with two layers (hidden unit size can grow exponentially in the worst case: one hidden unit per input example, and OR them). Continous functions: Every bounded continuous function can be approximated with an arbitrarily small error (output units are linear). Arbitrary functions: with three layers (output units are linear). Update rule applied after all input-target pairs are seen. Benefits: accurate estimate of the gradient, convergence to local minimum is guaranteed under simpler conditions Learning Hidden Layer Representations Learned Hidden Layer Representations Inputs Outputs Inputs Outputs Input Output Input Hidden Output Values

7 Learned Hidden Layer Representations Error Error versus weight updates (example ) Training set error Validation set error Number of weight updates Overfitting Error Error versus weight updates (example 2) Training set error Validation set error Number of weight updates Error in two different robot perception tass. Training set and validation set error. Learned encoding is similar to standard 3-bit binary code. Automatic discovery of useful hidden layer representations is a ey feature of ANN. Note: The hidden layer representation is compressed. 25 Early stopping ensures good performance on unobserved samples, but must be careful. Weight decay, use of validation sets, use of -fold cross-validation, etc. to overcome the problem. 26 Recurrent Networs Recurrent Networs (Cont d) output hidden delay Sequence recognition. Store tree structure (next slide). Can be trained with plain input stac input, stac input stac delay A A (A, B) B B C (A, B) (C, A, B) C (A, B) delay input context bacpropagation. Generalization may not be perfect. Autoassociation (intput output) Represent a stac using the hidden layer representation. Accuracy depends on numerical precision

8 Some Applications: NETtal NETtal: Sejnowsi and Rosenberg (987). Learn to pronounce English text. Demo Data available in UCI ML repository NETtal data aardvar a-rdvar <<<>2<< abac <>< abaft ><< 2<>> abandon ><>< abase xbes-><< abash abate xbet-><< Word Pronunciation Stress/Syllable about 2, words 29 3 More Applications: Data Compression Construct an autoassociative memory where Input Output. Train with small hidden layer. Encode using input-tohidden weights. Send or store hidden layer activation. Bacpropagation Exercise URL: Untar and read the README file: gzip -dc bacprop-.6.tar.gz tar xvf - Run mae to build (on departmental unix machines). Run./bp conf/xor.conf etc. Decode received or stored hidden layer activation with the hidden-to-output weights. 3 32

9 Bacpropagation: Example Results Error Bacprop OR AND OR , Epochs Epoch: one full cycle of training through all training input patterns. OR was easiest, AND the next, and OR was the most difficult to learn. Networ had 2 input, 2 hidden and output unit. Learning rate was.. Bacpropagation: Example Results (cont d) Error Bacprop OR AND OR OR, Epochs AND Output to (,), (,), (,), and (,) form each row. OR Bacpropagation: Things to Try MLP as a General Function Approximator How does increasing the number of hidden layer units affect the () time and the (2) number of epochs of training? How does increasing or decreasing the learning rate affect the rate of convergence? How does changing the slope of the sigmoid affect the rate of convergence? Different problem domains: handwriting recognition, etc. MLP can be seen as performing nonlinear input-output mapping. Universal approximation theorem: Let φ( ) be a nonconstant, bounded, monotone-increasing continuous function. Let I m denote the m -dimensional unit hypercube [, ] m. The space of continuous functions on I m is denoted by C(I m ). Then given any function f C(I m ) and ɛ >, there exists an integer m and a set of real constants α i, b i, and w ij, where i,..., m and j,..., m, such that we may define m m F (x,..., x m ) α i w ij x j + b i A i j as an approximate realization of the function f( ); that is F (x,..., x m ) f(x,..., x m ) < ɛ 35 for all x,..., x m that lie in the input space. 36

10 MLP as a General Function Approximator (cont d) Generalization The universal approximation theorem is an existence theorem, and it merely generalizes approximations by finite Fourier series. The universal approximation theorem is directly applicable to neural networs (MLP), and it implies that one hidden layer is sufficient. The theorem does not say that a single hidden layer is optimum in terms of learning time, generalization, etc. A networ is said to generalize well when the input-output mapping computed by the networ is correct (or nearly so) for test data never used during training. This view is apt when we tae the curve-fitting view. Issues: overfitting or overtraining, due to memorization. Smoothness in the mapping is desired, and this is related to criteria lie Occam s razor Generalization and Training Set Size Training Set Size and Curse of Dimensionality Generalization is influenced by three factors: Size of the training set, and how representative they are. The architecture of the networ. Physical complexity of the problem. Sample complexity and VC dimension are related. In practice, W N O ɛ «, where W is the total number of free parameters, and ɛ is the error tolerance. 39 D: 4 inputs 2D: 6 inputs 3D: 64 inputs As the dimensionality of the input grows, exponentially more inputs are needed to maintain the same density in unit space. In other words, the sampling density of N inputs in m-dimensional space is proportional to N /m. One way to overcome this is to use prior nowledge about the function. 4

11 Cross-Validation Virtues and Limitations of Bacprop Connectionism: biological metaphor, local computation, graceful degradation, paralellism. (Some limitations exist regarding the biological plausibility of bacprop.) Feature detection: hidden neurons perform feature detection. Function approximation: a form of nested sigmoid. Use of validation set (not used during training, used for measuring generalizability). Model selection Early stopping Hold-out method: multiple cross-validation, leave-one-out method, etc. 4 Computational complexity: computation is polynomial in the number of adjustable parameters, thus it can be said to be efficient. Sensitivity analysis: sensitivity S F ω efficiently. F/F ω/ω can be estimated Robustness: disturbances can only cause small estimation errors. Convergence: stochastic approximation, and it can be slow. Local minima and scaling issues 42 Heuristic for Accelerating Convergence Summary Learning rate adaptation Separate learning rate for each tunable weight. Each learning rate is allowed to adjust after each iteration. Bacprop for MLP is local and efficient (in calculating the partial derivative). Bacprop can handle nonlinear mappings. If the derivative of the cost function has the same sign for several iterations, increase the learning rate. If the derivative of the cost function alternates the sign over several iterations, decrease the learning rate

4. Multilayer Perceptrons

4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output