Multilayer Neural Networks

Similar documents
CS4442/9542b Artificial Intelligence II prof. Olga Veksler. Lecture 5 Machine Learning. Neural Networks. Many presentation Ideas are due to Andrew NG

Pattern Recognition and Machine Learning. Artificial Neural networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Pattern Recognition and Machine Learning. Artificial Neural networks

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 20. Chapter 20 1

Feedforward Networks

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Artificial Neural Networks. Historical description

Multilayer Perceptron = FeedForward Neural Network

Kernel Methods and Support Vector Machines

Feedforward Networks

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

Neural networks. Chapter 20, Section 5 1

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Ch 12: Variations on Backpropagation

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Pattern Classification

Part 8: Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Lecture 7 Artificial neural networks: Supervised learning

Block designs and statistics

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

ZISC Neural Network Base Indicator for Classification Complexity Estimation

Feature Extraction Techniques

Pattern Classification using Simplified Neural Networks with Pruning Algorithm

VI. Backpropagation Neural Networks (BPNN)

CS Lecture 13. More Maximum Likelihood

Artifical Neural Networks

An artificial neural networks (ANNs) model is a functional abstraction of the

CHARACTER RECOGNITION USING A SELF-ADAPTIVE TRAINING

Analyzing Simulation Results

Simple Neural Nets For Pattern Classification

Figure 1: Equivalent electric (RC) circuit of a neurons membrane

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

COS 424: Interacting with Data. Written Exercises

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Multilayer Perceptron

Combining Classifiers

Artificial Neural Networks The Introduction

1 Proof of learning bounds

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Pattern Recognition and Machine Learning. Artificial Neural networks

CS:4420 Artificial Intelligence

Machine Learning Basics: Estimators, Bias and Variance

Data Mining Part 5. Prediction

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Multilayer Neural Networks

Feedforward Neural Nets and Backpropagation

y(x n, w) t n 2. (1)

GREY FORECASTING AND NEURAL NETWORK MODEL OF SPORT PERFORMANCE

Neural Networks (Part 1) Goals for the lecture

One Dimensional Collisions

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Estimating Parameters for a Gaussian pdf

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Chapter 2 Single Layer Feedforward Networks

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Artificial Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Unit III. A Survey of Neural Network Model

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

NBN Algorithm Introduction Computational Fundamentals. Bogdan M. Wilamoswki Auburn University. Hao Yu Auburn University

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Revision: Neural Network

Lecture 2: Differential-Delay equations.

EEE 241: Linear Systems

Introduction Biologically Motivated Crude Model Backpropagation

SPECTRUM sensing is a core concept of cognitive radio

PAC-Bayes Analysis Of Maximum Entropy Learning

Introduction to Artificial Neural Networks

AI Programming CS F-20 Neural Networks

Machine Learning. Neural Networks

Efficient Filter Banks And Interpolators

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Lecture 16: Introduction to Neural Networks

A nonstandard cubic equation

Artificial neural networks

Projectile Motion with Air Resistance (Numerical Modeling, Euler s Method)

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Computational and Statistical Learning Theory

Course 395: Machine Learning - Lectures

Exact Classification with Two-Layer Neural Nets

Neural Networks and Deep Learning

Non-Parametric Non-Line-of-Sight Identification 1

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

A Theoretical Analysis of a Warm Start Technique

Fault Diagnosis of Planetary Gear Based on Fuzzy Entropy of CEEMDAN and MLP Neural Network by Using Vibration Signal

Sharp Time Data Tradeoffs for Linear Inverse Problems

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

1 Bounding the Margin

Supervised assessment: Modelling and problem-solving task

Transcription:

Multilayer Neural Networs

Brain s. Coputer Designed to sole logic and arithetic probles Can sole a gazillion arithetic and logic probles in an hour absolute precision Usually one ery fast procesor high reliability Eoled (in a large part) for pattern recognition Can sole a gazillion of PR probles in an hour Huge nuber of parallel but relatiely slow and unreliable processors not perfectly precise not perfectly reliable See an inspiration fro huan brain for PR?

Neuron: Basic Brain Processor Neurons are nere cells that transit signals to and fro brains at the speed of around 200ph Each neuron cell counicates to anywhere fro 000 to 0,000 other neurons, uscle cells, glands, so on Hae around 0 0 neurons in our brain (networ of neurons) Most neurons a person is eer going to hae are already present at birth

Neuron: Basic Brain Processor nucleus cell body axon dendrites Main coponents of a neuron Cell body which holds DNA inforation in nucleus Dendrites ay hae thousands of dendrites, usually short axon long structure, which splits in possibly thousands branches at the end. May be up to eter long

dendrites Neuron in Action (siplified) neuron body axon Input : neuron collects signals fro other neurons through dendrites, ay hae thousands of dendrites Processor: Signals are accuulated and processed by the cell body Output: If the strength of incoing signals is large enough, the cell body sends a signal (a spie of electrical actiity) to the axon

Neural Networ

ANN History: Birth 943, faous paper by W. McCulloch (neurophysiologist) and W. Pitts (atheatician) Using only ath and algoriths, constructed a odel of how neural networ ay wor Showed it is possible to construct any coputable function with their networ Was it possible to ae a odel of thoughts of a huan being? Considered to be the birth of AI 949, D. Hebb, introduced the first (purely pshychological) theory of learning Brain learns at tass through life, thereby it goes through treendous changes If two neurons fire together, they strengthen each other s responses and are liely to fire together in the future

ANN History: First Successes 958, F. Rosenblatt, perceptron, oldest neural networ still in use today Algorith to train the perceptron networ (training is still the ost actiely researched area today) Built in hardware Proed conergence in linearly separable case 959, B. Widrow and M. Hoff Madaline First ANN applied to real proble (eliinate echoes in phone lines) Still in coercial use

ANN History: Stagnation Early success lead to a lot of clais which were not fulfilled 969, M. Minsy and S. Pappert Boo Perceptrons Proed that perceptrons can learn only linearly separable classes In particular cannot learn ery siple XOR function Conectured that ultilayer neural networs also liited by linearly separable functions No funding and alost no research (at least in North Aerica) in 970 s as the result of 2 things aboe

ANN History: Reial Reial of ANN in 980 s 982, J. Hopfield New ind of networs (Hopfield s networs) Bidirectional connections between neurons Ipleents associatie eory 982 oint US-Japanese conference on ANN US worries that it will stay behind Many exaples of ulitlayer NN appear 982, discoery of bacpropagation algorith Allows a networ to learn not linearly separable classes Discoered independently by. Y. Lecunn 2. D. Parer 3. Ruelhart, Hinton, Willias

ANN: Perceptron Input and output layers g(x) w t x + w 0 Liitation: can learn only linearly separable classes

MNN: Feed Forward Operation input layer: d features hidden layer: output layer: outputs, one for each class x () w i z x (2) x (d) z bias unit

MNN: Notation for Weights Use w i to denote the weight between input unit i and hidden unit x (i) input unit i w i w i x (i) hidden unit y Use to denote the weight between hidden unit and output unit hidden unit y y output unit z

MNN: Notation for Actiation Use net i to denote the actiation and hidden unit net d i x ( i) w i + w 0 x () w x (2) w 2 w 0 hidden unit y Use net to denote the actiation at output unit net N H y + 0 y y 2 2 output unit z 0

Discriinant Function Discriinant function for class (the output of the th output unit) g ( x) z N H d f f i actiation at th hidden unit w i x ( i ) + w 0 + 0 actiation at th output unit

Discriinant Function

Expressie Power of MNN It can be shown that eery continuous function fro input to output can be ipleented with enough hidden units, hidden layer, and proper nonlinear actiation functions This is ore of theoretical than practical interest The proof is not constructie (does not tell us exactly how to construct the MNN) Een if it were constructie, would be of no use since we do not now the desired function anyway, our goal is to learn it through the saples But this result does gie us confidence that we are on the right trac MNN is general enough to construct the correct decision boundaries, unlie the Perceptron

MNN Actiation function Must be nonlinear for expressie power larger than that of perceptron If use linear actiation function at hidden layer, can only deal with linearly separable classes Suppose at hidden unit, h(u)a u g f d f i d f x i N H d ( i ) ( x) f h w x + w + N H i d a i N H N w i i x ( i ) + w 0 0 + 0 0 ( ) ( i ) a w x + a w + H w i new i + ( i ) a w i N H a 0 w 0 new w 0 + 0 0

MNN Actiation function could use a discontinuous actiation function f ( ) net if if net net < 0 0 Howeer, we will use gradient descent for learning, so we need to use a continuous actiation function sigoid function Fro now on, assue f is a differentiable function

MNN: Modes of Operation Networ hae two odes of operation: Feedforward The feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the signals through the networ in order to get outputs units (no cycles!) Learning The superised learning consists of presenting an input pattern and odifying the networ paraeters (weights) to reduce distances between the coputed output and the desired output

MNN Can ary nuber of hidden layers Nonlinear actiation function Can use different function for hidden and output layers Can use different function at each hidden and output node

MNN: Class Representation Training saples x,, x n each of class,, Let networ output z represent class c as target t (c) saple of class c z M z zc M z t ( c) 0 M M 0 MNN with weights w i and cth row Our Ultiate Goal For FeedForward Operation MNN training to achiee the Ultiate Goal Modify (learn) MNN paraeters w i and so that for each training saple of class c MNN output z t (c) t (c)

Networ Training (learning). Initialize weights w i and randoly but not to 0 2. Iterate until a stopping criterion is reached choose p input saple x p MNN with weights w i and output z z M z Copare output z with the desired target t; adust w i and to oe closer to the goal t (by bacpropagation)

BacPropagation Learn w i and by iniizing the training error What is the training error? Suppose the output of MNN for saple x is z and the target (desired output for x ) is t Error on one saple: J 2 ( w, ) ( ) c t c z c 2 Training error: J 2 n ( i) ( i) ( w, ) t c z ( ) c i c 2 Use gradient descent: ( 0 ) (,w 0) rando repeat until conergence: w w η J ( w ) ( t ) ( t+ ) ( t) ( t) ( t+ ) ( t) ( ) η w J

BacPropagation For siplicity, first tae training error for one saple x i J 2 ( w, ) ( t c z c ) c fixed constant 2 function of w, z f N H f d i w i x ( i) + w 0 + 0 Need to copute. partial deriatie w.r.t. hidden-to-output weights 2. partial deriatie w.r.t. input-to-hidden weights J J w i

BacPropagation: Layered Model actiation at hidden unit net d i x ( i) w i + w 0 output at hidden unit y ( ) f net actiation at output unit actiation at output unit net N H y + ( z ) f net 0 chain rule chain rule obectie function J 2 ( w, ) ( t c z c ) c 2 J J w i

BacPropagation ( ) ( ) c t c z c w J 2 2, ( ) net z f + H N y net 0 J ( ) net net z z t ( ) c c c z t 2 2 ( ) ( ) c c c c c z t z t ( ) ( ) z t z t ( ) ( ) z z t ( ) ( ) ( ) ( ) 0 ' 0 ' if net f z t if y net f z t J First copute hidden-to-output deriaties

BacPropagation Gradient Descent Single Saple Update Rule for hidden-to-output weights > 0: 0 (bias weight): ( t+ ) ( t) ( ) ( ) + η t z f ' net y ( t+ ) ( t) ( ) ( + η t z f net ) 0 0 '

BacPropagation Now copute input-to-hidden J w i ( t z) ( t z) ( t z) z w ( t z ) f ( net ) i w ( t z ) f ( net ) i net y ( t z ) y y w net i net w z net ( ) ( ) ( ) ( i t z f net f net x ) ( t z ) f ( net ) f ( net ) i net w J i if i 0 if i 0 w i net J h net y f( net ) d h N H s z f( net ) 2 x ( i) y ( w, ) ( ) s c w hi s + w + t c z c h0 0 2

BacPropagation J w i f f ( i) ( net ) x ( t z ) f ( net ) ( net ) ( t z ) f ( net ) if if i i 0 0 Gradient Descent Single Saple Update Rule for input-to-hidden weights w i i > 0: w i 0 (bias weight): ( t+ ) ( t) ( ) ( i) ( ) ( ) i w i + η f net x t z f net w t+ t w f net t z f 0 0 + η ( ) ( ) ( ) ( ) ( net ) (t) (t)

BacPropagation of Errors J w i f ( ) i ( net ) x ( t z) f ( net) J ( t z) f '( net ) y unit i unit error z z Nae bacpropagation because during training, errors propagated bac fro output to hidden layer

BacPropagation Consider update rule for hidden-to-output weights: Suppose t z > 0 ( t+ ) ( t) ( ) ( ) + η t z f ' net y Then output of the th hidden unit is too sall: t > z Typically actiation function f is s.t. f > 0 Thus ( ) ( t z f ' net ) > 0 There are 2 cases:. y > 0, then to increase z, should increase weight η t z f ' net y > which is exactly what we do since y z ( ) ( ) 0 2. y < 0, then to increase z, should decrease weight η t z f ' net y < which is exactly what we do since ( ) ( ) 0

BacPropagation The case t z < 0 is analogous Siilarly, can show that input-to-hidden weights ae sense Iportant: weights should be initialized to rando nonzero nubers J w i f ( ) ( i) ( ) ( net ) x t z f net if 0, input-to-hidden weights w i neer updated

Training Protocols How to present saples in training set and update the weights? Three aor training protocols:. Stochastic Patterns are chosen randoly fro the training set, and networ weights are updated after eery saple presentation 2. Batch weights are update based on all saples; iterate weight update 3. Online each saple is presented only once, weight update after each saple presentation

Stochastic Bac Propagation. Initialize nuber of hidden layers n H weights w, conergence criterion θ and learning rate η tie t 0 2. do x randoly chosen training pattern for all w w t t + until J 3. return, w 0 i d, 0 nh, 0 + η + η ( t z) f '( net) y ( t z) f ( net ) ( i) f ( net ) x ( t z) f ( net) f ( net ) ( t z) f ( net ) 0 0 ' i w i + η 0 w 0 + η < θ

Batch Bac Propagation This is the true gradient descent, (unlie stochastic propagation) For siplicity, deried bacpropagation for a single saple obectie function: J ( w, ) ( ) The full obectie function: n ( i) ( i) J( w, ) t c z 2 2 c ( ) c i c t c z c Deriatie of full obectie function is ust a su of deriaties for each saple: J 2 ( i) ( i) ( w, ) t c z ( ) c w i w c n already deried this 2 2 2

Batch Bac Propagation For exaple, J w i n f ( ) ( i) ( ) ( net x t z f net ) p p

Batch Bac Propagation. Initialize n H, w,, θ, η, t 0 2. do one epoch t t + until J for all p n for all 0 <θ + 3. return, w 0 w i w 0 w w + η + η ( t z) f' ( net) y ( t z) f ( net ) ( i) f ( net ) x p( t z) f ( net ) f ( net ) ( t z ) f ( net ) 0 0 ' i 0 i d, 0 nh, 0 w + η i w + η 0 0 ; 0 0 + 0; w i w i + w i; w 0 w 0 + w 0

Training Protocols. Batch True gradient descent 2. Stochastic Faster than batch ethod Usually the recoended way 3. Online Used when nuber of saples is so large it does not fit in the eory Dependent on the order of saple presentation Should be aoided when possible

MNN Training training tie Large training error: in the beginning rando decision regions Sall training error: decision regions iproe with tie Zero training error: decision regions separate training data perfectly, but we oerfited the networ

MNN Learning Cures Training data: data on which learning (gradient descent for MNN) is perfored Validation data: used to assess networ generalization capabilities Training error typically goes down, since with enough hidden units, can find discriinant function which classifies training patterns exactly classification error alidation error training error training tie Validation error first goes down, but then goes up since at soe point we start to oerfit the networ to the alidation data

Learning Cures classification error alidation error training error training tie this is a good tie to stop training, since after this tie we start to oerfit Stopping criterion is part of training phase, thus alidation data is part of the training data To assess how the networ will wor on the unseen exaples, we still need test data

Learning Cures alidation data is used to deterine paraeters, in this case when learning should stop stop training Stop training after the first local iniu on alidation data We are assuing perforance on test data will be siilar to perforance on alidation data

Data Sets Training data data on which learning is perfored Validation data alidation data is used to deterine any free paraeters of the classifier in the nn neighbor classifier h for parzen windows nuber of hidden layers in the MNN etc Test data used to assess networ generalization capabilities

MNN as Nonlinear Mapping this odule ipleents nonlinear input apping ϕ this odule ipleents linear classifier (Perceptron) x () z x (2) x (d) z

MNN as Nonlinear Mapping Thus MNN can be thought as learning 2 things at the sae tie the nonlinear apping of the inputs linear classifier of the nonlinearly apped inputs

MNN as Nonlinear Mapping original feature space x; patterns are not linearly separable MNN finds nonlinear apping yϕ(x) to 2 diensions (2 hidden units); patterns are alost linearly separable MNN finds nonlinear apping yϕ(x) to 3 diensions (3 hidden units) that; patterns are linearly separable

Concluding Rears Adantages MNN can learn coplex appings fro inputs to outputs, based only on the training saples Easy to use Easy to incorporate a lot of heuristics Disadantages It is a blac box, that is difficult to analyze and predict its behaior May tae a long tie to train May get trapped in a bad local inia A lot of trics to ipleent for the best perforance