Slide04 Haykin Chapter 4: Multi-Layer Perceptrons

Similar documents
4. Multilayer Perceptrons

Artificial Neural Networks

Multilayer Perceptron

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Artificial Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Pattern Classification

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural networks. Chapter 20. Chapter 20 1

Data Mining Part 5. Prediction

Lecture 4: Perceptrons and Multilayer Perceptrons

Multilayer Perceptron = FeedForward Neural Network

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural networks. Chapter 19, Sections 1 5 1

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

CS:4420 Artificial Intelligence

NN V: The generalized delta learning rule

Neural Networks and Deep Learning

Classification with Perceptrons. Reading:

Multilayer Perceptrons (MLPs)

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Bidirectional Representation and Backpropagation Learning

y(x n, w) t n 2. (1)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Neural Networks

Artificial Neural Networks. Edward Gatt

Neural Networks biological neuron artificial neuron 1

Machine Learning

Course 395: Machine Learning - Lectures

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

18.6 Regression and Classification with Linear Models

Lecture 5: Logistic Regression. Neural Networks

AI Programming CS F-20 Neural Networks

Lecture 16: Introduction to Neural Networks

Lab 5: 16 th April Exercises on Neural Networks

Neural Networks and the Back-propagation Algorithm

Neural networks. Chapter 20, Section 5 1

Machine Learning. Neural Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Artificial Neural Networks Examination, June 2004

Computational statistics

Advanced statistical methods for data analysis Lecture 2

Artificial Neural Networks Examination, March 2004

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning Lecture 5

Multilayer Perceptron

Part 8: Neural Networks

Multilayer Neural Networks

Introduction to Machine Learning

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Christian Mohr

Deep Neural Networks (1) Hidden layers; Back-propagation

CSC242: Intro to AI. Lecture 21

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Neural Networks Lecture 3:Multi-Layer Perceptron

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Intelligence

Artificial Neural Networks Examination, June 2005

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Multilayer Perceptrons and Backpropagation

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

Machine Learning

Machine Learning: Multi Layer Perceptrons

Introduction to Neural Networks

Unit 8: Introduction to neural networks. Perceptrons

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Artificial Neural Networks. MGS Lecture 2

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Introduction to Machine Learning Spring 2018 Note Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Lecture 7 Artificial neural networks: Supervised learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Multilayer Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Artifical Neural Networks

CSC321 Lecture 8: Optimization

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks (Part 1) Goals for the lecture

Chapter 3 Supervised learning:

Deep Neural Networks (1) Hidden layers; Back-propagation

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

MLPR: Logistic Regression and Neural Networks

Transcription:

Introduction Slide4 Hayin Chapter 4: Multi-Layer Perceptrons CPSC 636-6 Instructor: Yoonsuc Choe Spring 28 Networs typically consisting of input, hidden, and output layers. Commonly referred to as Multilayer perceptrons. Popular learning algorithm is the error bacpropagation algorithm (bacpropagation, or bacprop, for short), which is a generalization of the LMS rule. Some materials from this lecture are from Mitchell (997) Machine Learning, McGraw-Hill. Forward pass: activate the networ, layer by layer Bacward pass: error signal bacpropagates from output to hidden and hidden to input, based on which weights are updated. 2 Multilayer Perceptrons: Characteristics Multilayer Networs Each model neuron has a nonlinear activation function, typically a logistic function: yj +exp( v ) j Networ contains one or more hidden layers (layers that are not either an input or an output layer). Differentiable threshold unit: sigmoid φ(v) property: dφ(v) dv φ(v)( φ(v)). Output: y φ(xt w) Other functions: tanh(v) exp( 2v) exp( 2v)+ Networ exhibits a high degree of connectivity. 3 4. Interesting +exp( v)

Multilayer Networs and Bacpropagation head hid who d hood...... F F2 Nonlinear decision surfaces. Output.9.8.7.6.5.4.3.2..2.4 Input.6.8.2 sigm(x+y-.).8.6.4.2.4 (a) One output Another example: OR.6.8 Input 2 5.55.545.54.535.53.525.52.55.5.55.5 Output.2.4 Input sigm(sigm(x+y-.)+sigm(-x-y+.3)-).54.53.52.5.6.8.2.4.6.8 Input 2 (b) Two hidden, one output Error Gradient for a Single Sigmoid Unit For n input-output pairs {(x, d )} n : 2 2 (d y ) 2 (d y ) 2 2(d y ) (d y ) 2 w i (d y ) y «w i (d y ) y Chain rule 6 From the previous page: But we now: So: Error Gradient for a Sigmoid Unit (d y ) y y φ(v ) y ( y ) (xt w) x i, (d y )y ( y )x i, Bacpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do For each training example, Do. Input the training example to the networ and compute the networ outputs 2. For each output unit j δ j y j ( y j )(d j y j ) 3. For each hidden unit h δ h y h ( y h ) P j outputs w jhδ j 4. Update each networ weight w i,j w ji w ji + w ji where w ji ηδ j x i. Note: w ji is the weight from i to j (i.e., w j i ). 7 8

For output unit: For hidden unit: The δ Term δ j y j ( y j ) (d j y j ) φ (v j ) Error δ h y h ( y h ) w jh δ j φ j outputs (v h ) Bacpropagated error In sum, δ is the derivative times the error. Derivation to be presented later. Want to update weight as: where error is defined as Derivation of w E(w) 2 Given v j P j w jix i, w ji η, j outputs Different formula for output and hidden. (d j y j ) 2 9 Derivation of w: Output Unit Weights From the previous page, First, calculate : 2 j outputs 2 (d j y j ) 2 (d j y j ) 2 2 2 (d j y j ) (d j y j ) (d j y j ) Derivation of w: Output Unit Weights From the previous page, Next, calculate : Since y j φ(v j ), and φ (v j ) y j ( y j ), Putting everything together, y j ( y j ). (d j y j ) : (d j y j )y j ( y j ). 2

Derivation of w: Output Unit Weights From the previous page: Since (d j y j )y j ( y j ). P i w ji x i x, ji (d j y j )y j ( y j ) δ j error φ (net) x i {z} input Start with Derivation of w: Hidden Unit Weights x i : δ δ δ w j δ w j y j ( y j ) φ (net) () 3 4 Derivation of w: Hidden Unit Weights Summary Finally, given and x i, δ w j y j ( y j ), φ (net) w ji η η [y j ( y j ) φ (net) δ w j] error δ j x i w ji (n) η {z} δ j (n) y i (n) weight correction learning rate local gradient input signal 5 6

Extension to Different Networ Topologies w j j w ji i output hidden input Arbitrary number of layers: for neurons in layer m: δ r y r ( y r ) Arbitrary acyclic graph: δ r y r ( y r ) s layer m+ w sr δs. w sr δs. s Downstream(r) Bacpropagation: Properties Gradient descent over entire networ weight vector. Easily generalized to arbitrary directed graphs. Will find a local, not necessarily global error minimum: In practice, often wors well (can run multiple times with different initial weights). Minimizes error over training examples: Will it generalize well to subsequent examples? Training can tae thousands of iterations slow! Using the networ after training is very fast. 7 8 Learning Rate and Momentum Tradeoffs regarding learning rate: Smaller learning rate: smoother trajectory but slower convergence Larger learning rate: fast convergence, but can become unstable. Momentum can help overcome the issues above. w ji (n) ηδ j (n)y i (n) + α w ji (n ). The update rule can be written as: w ji (n) η n α n t δ j (t)y i (t) η t n t α n t (t) (t). Momentum (cont d) w ji (n) n t α n t (t) (t) The weight vector is the sum of an exponentially weighted time series. Behavior: When successive (t) tae the same sign: (t) Weight update is accelerated (speed up downhill). When successive (t) have different signs: (t) Weight update is damped (stabilize oscillation). 9 2

Sequential (online) vs. Batch Training Sequential mode: Update rule applied after each input-target presentation. Order of presentation should be randomized. Benefits: less storage, stochastic search through weight space helps avoid local minima. Disadvantages: hard to establish theoretical convergence conditions. Batch mode: Representational Power of Feedforward Networs Boolean functions: every boolean function representable with two layers (hidden unit size can grow exponentially in the worst case: one hidden unit per input example, and OR them). Continous functions: Every bounded continuous function can be approximated with an arbitrarily small error (output units are linear). Arbitrary functions: with three layers (output units are linear). Update rule applied after all input-target pairs are seen. Benefits: accurate estimate of the gradient, convergence to local minimum is guaranteed under simpler conditions. 2 22 Learning Hidden Layer Representations Learned Hidden Layer Representations Inputs Outputs Inputs Outputs Input Output Input Hidden Output Values.89.4.8...88..97.27.99.97.7.3.5.2.22.99.99.8..98.6.94. 23 24

Learned Hidden Layer Representations Error..9.8.7.6.5.4.3 Error versus weight updates (example ) Training set error Validation set error.2 5 5 2 Number of weight updates Overfitting Error.8.7.6.5.4.3.2. Error versus weight updates (example 2) Training set error Validation set error 2 3 4 5 6 Number of weight updates Error in two different robot perception tass. Training set and validation set error. Learned encoding is similar to standard 3-bit binary code. Automatic discovery of useful hidden layer representations is a ey feature of ANN. Note: The hidden layer representation is compressed. 25 Early stopping ensures good performance on unobserved samples, but must be careful. Weight decay, use of validation sets, use of -fold cross-validation, etc. to overcome the problem. 26 Recurrent Networs Recurrent Networs (Cont d) output hidden delay Sequence recognition. Store tree structure (next slide). Can be trained with plain input stac input, stac input stac delay A A (A, B) B B C (A, B) (C, A, B) C (A, B) delay input context bacpropagation. Generalization may not be perfect. Autoassociation (intput output) Represent a stac using the hidden layer representation. Accuracy depends on numerical precision. 27 28

Some Applications: NETtal NETtal: Sejnowsi and Rosenberg (987). Learn to pronounce English text. Demo Data available in UCI ML repository NETtal data aardvar a-rdvar <<<>2<< abac xb@-><< abacus @bxxs <>< abaft xb@ft ><< abalone @bxloni 2<>> abandon xb@ndxn ><>< abase xbes-><< abash xb@s-><< abate xbet-><< abatis @bxti-<>2<2... Word Pronunciation Stress/Syllable about 2, words 29 3 More Applications: Data Compression Construct an autoassociative memory where Input Output. Train with small hidden layer. Encode using input-tohidden weights. Send or store hidden layer activation. Bacpropagation Exercise URL: http://www.cs.tamu.edu/faculty/choe/src/bacprop-.6.tar.gz Untar and read the README file: gzip -dc bacprop-.6.tar.gz tar xvf - Run mae to build (on departmental unix machines). Run./bp conf/xor.conf etc. Decode received or stored hidden layer activation with the hidden-to-output weights. 3 32

Bacpropagation: Example Results Error.25.2.5..5 Bacprop OR AND OR 5 5 2 25 3 35 4, Epochs Epoch: one full cycle of training through all training input patterns. OR was easiest, AND the next, and OR was the most difficult to learn. Networ had 2 input, 2 hidden and output unit. Learning rate was.. Bacpropagation: Example Results (cont d) Error.25.2.5..5 Bacprop OR AND OR 5 5 2 25 3 35 4 OR, Epochs AND Output to (,), (,), (,), and (,) form each row. OR 33 34 Bacpropagation: Things to Try MLP as a General Function Approximator How does increasing the number of hidden layer units affect the () time and the (2) number of epochs of training? How does increasing or decreasing the learning rate affect the rate of convergence? How does changing the slope of the sigmoid affect the rate of convergence? Different problem domains: handwriting recognition, etc. MLP can be seen as performing nonlinear input-output mapping. Universal approximation theorem: Let φ( ) be a nonconstant, bounded, monotone-increasing continuous function. Let I m denote the m -dimensional unit hypercube [, ] m. The space of continuous functions on I m is denoted by C(I m ). Then given any function f C(I m ) and ɛ >, there exists an integer m and a set of real constants α i, b i, and w ij, where i,..., m and j,..., m, such that we may define m m F (x,..., x m ) α i φ @ w ij x j + b i A i j as an approximate realization of the function f( ); that is F (x,..., x m ) f(x,..., x m ) < ɛ 35 for all x,..., x m that lie in the input space. 36

MLP as a General Function Approximator (cont d) Generalization The universal approximation theorem is an existence theorem, and it merely generalizes approximations by finite Fourier series. The universal approximation theorem is directly applicable to neural networs (MLP), and it implies that one hidden layer is sufficient. The theorem does not say that a single hidden layer is optimum in terms of learning time, generalization, etc. A networ is said to generalize well when the input-output mapping computed by the networ is correct (or nearly so) for test data never used during training. This view is apt when we tae the curve-fitting view. Issues: overfitting or overtraining, due to memorization. Smoothness in the mapping is desired, and this is related to criteria lie Occam s razor. 37 38 Generalization and Training Set Size Training Set Size and Curse of Dimensionality Generalization is influenced by three factors: Size of the training set, and how representative they are. The architecture of the networ. Physical complexity of the problem. Sample complexity and VC dimension are related. In practice, W N O ɛ «, where W is the total number of free parameters, and ɛ is the error tolerance. 39 D: 4 inputs 2D: 6 inputs 3D: 64 inputs As the dimensionality of the input grows, exponentially more inputs are needed to maintain the same density in unit space. In other words, the sampling density of N inputs in m-dimensional space is proportional to N /m. One way to overcome this is to use prior nowledge about the function. 4

Cross-Validation Virtues and Limitations of Bacprop Connectionism: biological metaphor, local computation, graceful degradation, paralellism. (Some limitations exist regarding the biological plausibility of bacprop.) Feature detection: hidden neurons perform feature detection. Function approximation: a form of nested sigmoid. Use of validation set (not used during training, used for measuring generalizability). Model selection Early stopping Hold-out method: multiple cross-validation, leave-one-out method, etc. 4 Computational complexity: computation is polynomial in the number of adjustable parameters, thus it can be said to be efficient. Sensitivity analysis: sensitivity S F ω efficiently. F/F ω/ω can be estimated Robustness: disturbances can only cause small estimation errors. Convergence: stochastic approximation, and it can be slow. Local minima and scaling issues 42 Heuristic for Accelerating Convergence Summary Learning rate adaptation Separate learning rate for each tunable weight. Each learning rate is allowed to adjust after each iteration. Bacprop for MLP is local and efficient (in calculating the partial derivative). Bacprop can handle nonlinear mappings. If the derivative of the cost function has the same sign for several iterations, increase the learning rate. If the derivative of the cost function alternates the sign over several iterations, decrease the learning rate. 43 44