Multilayer Neural Networs
Brain s. Coputer Designed to sole logic and arithetic probles Can sole a gazillion arithetic and logic probles in an hour absolute precision Usually one ery fast procesor high reliability Eoled (in a large part) for pattern recognition Can sole a gazillion of PR probles in an hour Huge nuber of parallel but relatiely slow and unreliable processors not perfectly precise not perfectly reliable See an inspiration fro huan brain for PR?
Neuron: Basic Brain Processor Neurons are nere cells that transit signals to and fro brains at the speed of around 200ph Each neuron cell counicates to anywhere fro 000 to 0,000 other neurons, uscle cells, glands, so on Hae around 0 0 neurons in our brain (networ of neurons) Most neurons a person is eer going to hae are already present at birth
Neuron: Basic Brain Processor nucleus cell body axon dendrites Main coponents of a neuron Cell body which holds DNA inforation in nucleus Dendrites ay hae thousands of dendrites, usually short axon long structure, which splits in possibly thousands branches at the end. May be up to eter long
dendrites Neuron in Action (siplified) neuron body axon Input : neuron collects signals fro other neurons through dendrites, ay hae thousands of dendrites Processor: Signals are accuulated and processed by the cell body Output: If the strength of incoing signals is large enough, the cell body sends a signal (a spie of electrical actiity) to the axon
Neural Networ
ANN History: Birth 943, faous paper by W. McCulloch (neurophysiologist) and W. Pitts (atheatician) Using only ath and algoriths, constructed a odel of how neural networ ay wor Showed it is possible to construct any coputable function with their networ Was it possible to ae a odel of thoughts of a huan being? Considered to be the birth of AI 949, D. Hebb, introduced the first (purely pshychological) theory of learning Brain learns at tass through life, thereby it goes through treendous changes If two neurons fire together, they strengthen each other s responses and are liely to fire together in the future
ANN History: First Successes 958, F. Rosenblatt, perceptron, oldest neural networ still in use today Algorith to train the perceptron networ (training is still the ost actiely researched area today) Built in hardware Proed conergence in linearly separable case 959, B. Widrow and M. Hoff Madaline First ANN applied to real proble (eliinate echoes in phone lines) Still in coercial use
ANN History: Stagnation Early success lead to a lot of clais which were not fulfilled 969, M. Minsy and S. Pappert Boo Perceptrons Proed that perceptrons can learn only linearly separable classes In particular cannot learn ery siple XOR function Conectured that ultilayer neural networs also liited by linearly separable functions No funding and alost no research (at least in North Aerica) in 970 s as the result of 2 things aboe
ANN History: Reial Reial of ANN in 980 s 982, J. Hopfield New ind of networs (Hopfield s networs) Bidirectional connections between neurons Ipleents associatie eory 982 oint US-Japanese conference on ANN US worries that it will stay behind Many exaples of ulitlayer NN appear 982, discoery of bacpropagation algorith Allows a networ to learn not linearly separable classes Discoered independently by. Y. Lecunn 2. D. Parer 3. Ruelhart, Hinton, Willias
ANN: Perceptron Input and output layers g(x) w t x + w 0 Liitation: can learn only linearly separable classes
MNN: Feed Forward Operation input layer: d features hidden layer: output layer: outputs, one for each class x () w i z x (2) x (d) z bias unit
MNN: Notation for Weights Use w i to denote the weight between input unit i and hidden unit x (i) input unit i w i w i x (i) hidden unit y Use to denote the weight between hidden unit and output unit hidden unit y y output unit z
MNN: Notation for Actiation Use net i to denote the actiation and hidden unit net d i x ( i) w i + w 0 x () w x (2) w 2 w 0 hidden unit y Use net to denote the actiation at output unit net N H y + 0 y y 2 2 output unit z 0
Discriinant Function Discriinant function for class (the output of the th output unit) g ( x) z N H d f f i actiation at th hidden unit w i x ( i ) + w 0 + 0 actiation at th output unit
Discriinant Function
Expressie Power of MNN It can be shown that eery continuous function fro input to output can be ipleented with enough hidden units, hidden layer, and proper nonlinear actiation functions This is ore of theoretical than practical interest The proof is not constructie (does not tell us exactly how to construct the MNN) Een if it were constructie, would be of no use since we do not now the desired function anyway, our goal is to learn it through the saples But this result does gie us confidence that we are on the right trac MNN is general enough to construct the correct decision boundaries, unlie the Perceptron
MNN Actiation function Must be nonlinear for expressie power larger than that of perceptron If use linear actiation function at hidden layer, can only deal with linearly separable classes Suppose at hidden unit, h(u)a u g f d f i d f x i N H d ( i ) ( x) f h w x + w + N H i d a i N H N w i i x ( i ) + w 0 0 + 0 0 ( ) ( i ) a w x + a w + H w i new i + ( i ) a w i N H a 0 w 0 new w 0 + 0 0
MNN Actiation function could use a discontinuous actiation function f ( ) net if if net net < 0 0 Howeer, we will use gradient descent for learning, so we need to use a continuous actiation function sigoid function Fro now on, assue f is a differentiable function
MNN: Modes of Operation Networ hae two odes of operation: Feedforward The feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the signals through the networ in order to get outputs units (no cycles!) Learning The superised learning consists of presenting an input pattern and odifying the networ paraeters (weights) to reduce distances between the coputed output and the desired output
MNN Can ary nuber of hidden layers Nonlinear actiation function Can use different function for hidden and output layers Can use different function at each hidden and output node
MNN: Class Representation Training saples x,, x n each of class,, Let networ output z represent class c as target t (c) saple of class c z M z zc M z t ( c) 0 M M 0 MNN with weights w i and cth row Our Ultiate Goal For FeedForward Operation MNN training to achiee the Ultiate Goal Modify (learn) MNN paraeters w i and so that for each training saple of class c MNN output z t (c) t (c)
Networ Training (learning). Initialize weights w i and randoly but not to 0 2. Iterate until a stopping criterion is reached choose p input saple x p MNN with weights w i and output z z M z Copare output z with the desired target t; adust w i and to oe closer to the goal t (by bacpropagation)
BacPropagation Learn w i and by iniizing the training error What is the training error? Suppose the output of MNN for saple x is z and the target (desired output for x ) is t Error on one saple: J 2 ( w, ) ( ) c t c z c 2 Training error: J 2 n ( i) ( i) ( w, ) t c z ( ) c i c 2 Use gradient descent: ( 0 ) (,w 0) rando repeat until conergence: w w η J ( w ) ( t ) ( t+ ) ( t) ( t) ( t+ ) ( t) ( ) η w J
BacPropagation For siplicity, first tae training error for one saple x i J 2 ( w, ) ( t c z c ) c fixed constant 2 function of w, z f N H f d i w i x ( i) + w 0 + 0 Need to copute. partial deriatie w.r.t. hidden-to-output weights 2. partial deriatie w.r.t. input-to-hidden weights J J w i
BacPropagation: Layered Model actiation at hidden unit net d i x ( i) w i + w 0 output at hidden unit y ( ) f net actiation at output unit actiation at output unit net N H y + ( z ) f net 0 chain rule chain rule obectie function J 2 ( w, ) ( t c z c ) c 2 J J w i
BacPropagation ( ) ( ) c t c z c w J 2 2, ( ) net z f + H N y net 0 J ( ) net net z z t ( ) c c c z t 2 2 ( ) ( ) c c c c c z t z t ( ) ( ) z t z t ( ) ( ) z z t ( ) ( ) ( ) ( ) 0 ' 0 ' if net f z t if y net f z t J First copute hidden-to-output deriaties
BacPropagation Gradient Descent Single Saple Update Rule for hidden-to-output weights > 0: 0 (bias weight): ( t+ ) ( t) ( ) ( ) + η t z f ' net y ( t+ ) ( t) ( ) ( + η t z f net ) 0 0 '
BacPropagation Now copute input-to-hidden J w i ( t z) ( t z) ( t z) z w ( t z ) f ( net ) i w ( t z ) f ( net ) i net y ( t z ) y y w net i net w z net ( ) ( ) ( ) ( i t z f net f net x ) ( t z ) f ( net ) f ( net ) i net w J i if i 0 if i 0 w i net J h net y f( net ) d h N H s z f( net ) 2 x ( i) y ( w, ) ( ) s c w hi s + w + t c z c h0 0 2
BacPropagation J w i f f ( i) ( net ) x ( t z ) f ( net ) ( net ) ( t z ) f ( net ) if if i i 0 0 Gradient Descent Single Saple Update Rule for input-to-hidden weights w i i > 0: w i 0 (bias weight): ( t+ ) ( t) ( ) ( i) ( ) ( ) i w i + η f net x t z f net w t+ t w f net t z f 0 0 + η ( ) ( ) ( ) ( ) ( net ) (t) (t)
BacPropagation of Errors J w i f ( ) i ( net ) x ( t z) f ( net) J ( t z) f '( net ) y unit i unit error z z Nae bacpropagation because during training, errors propagated bac fro output to hidden layer
BacPropagation Consider update rule for hidden-to-output weights: Suppose t z > 0 ( t+ ) ( t) ( ) ( ) + η t z f ' net y Then output of the th hidden unit is too sall: t > z Typically actiation function f is s.t. f > 0 Thus ( ) ( t z f ' net ) > 0 There are 2 cases:. y > 0, then to increase z, should increase weight η t z f ' net y > which is exactly what we do since y z ( ) ( ) 0 2. y < 0, then to increase z, should decrease weight η t z f ' net y < which is exactly what we do since ( ) ( ) 0
BacPropagation The case t z < 0 is analogous Siilarly, can show that input-to-hidden weights ae sense Iportant: weights should be initialized to rando nonzero nubers J w i f ( ) ( i) ( ) ( net ) x t z f net if 0, input-to-hidden weights w i neer updated
Training Protocols How to present saples in training set and update the weights? Three aor training protocols:. Stochastic Patterns are chosen randoly fro the training set, and networ weights are updated after eery saple presentation 2. Batch weights are update based on all saples; iterate weight update 3. Online each saple is presented only once, weight update after each saple presentation
Stochastic Bac Propagation. Initialize nuber of hidden layers n H weights w, conergence criterion θ and learning rate η tie t 0 2. do x randoly chosen training pattern for all w w t t + until J 3. return, w 0 i d, 0 nh, 0 + η + η ( t z) f '( net) y ( t z) f ( net ) ( i) f ( net ) x ( t z) f ( net) f ( net ) ( t z) f ( net ) 0 0 ' i w i + η 0 w 0 + η < θ
Batch Bac Propagation This is the true gradient descent, (unlie stochastic propagation) For siplicity, deried bacpropagation for a single saple obectie function: J ( w, ) ( ) The full obectie function: n ( i) ( i) J( w, ) t c z 2 2 c ( ) c i c t c z c Deriatie of full obectie function is ust a su of deriaties for each saple: J 2 ( i) ( i) ( w, ) t c z ( ) c w i w c n already deried this 2 2 2
Batch Bac Propagation For exaple, J w i n f ( ) ( i) ( ) ( net x t z f net ) p p
Batch Bac Propagation. Initialize n H, w,, θ, η, t 0 2. do one epoch t t + until J for all p n for all 0 <θ + 3. return, w 0 w i w 0 w w + η + η ( t z) f' ( net) y ( t z) f ( net ) ( i) f ( net ) x p( t z) f ( net ) f ( net ) ( t z ) f ( net ) 0 0 ' i 0 i d, 0 nh, 0 w + η i w + η 0 0 ; 0 0 + 0; w i w i + w i; w 0 w 0 + w 0
Training Protocols. Batch True gradient descent 2. Stochastic Faster than batch ethod Usually the recoended way 3. Online Used when nuber of saples is so large it does not fit in the eory Dependent on the order of saple presentation Should be aoided when possible
MNN Training training tie Large training error: in the beginning rando decision regions Sall training error: decision regions iproe with tie Zero training error: decision regions separate training data perfectly, but we oerfited the networ
MNN Learning Cures Training data: data on which learning (gradient descent for MNN) is perfored Validation data: used to assess networ generalization capabilities Training error typically goes down, since with enough hidden units, can find discriinant function which classifies training patterns exactly classification error alidation error training error training tie Validation error first goes down, but then goes up since at soe point we start to oerfit the networ to the alidation data
Learning Cures classification error alidation error training error training tie this is a good tie to stop training, since after this tie we start to oerfit Stopping criterion is part of training phase, thus alidation data is part of the training data To assess how the networ will wor on the unseen exaples, we still need test data
Learning Cures alidation data is used to deterine paraeters, in this case when learning should stop stop training Stop training after the first local iniu on alidation data We are assuing perforance on test data will be siilar to perforance on alidation data
Data Sets Training data data on which learning is perfored Validation data alidation data is used to deterine any free paraeters of the classifier in the nn neighbor classifier h for parzen windows nuber of hidden layers in the MNN etc Test data used to assess networ generalization capabilities
MNN as Nonlinear Mapping this odule ipleents nonlinear input apping ϕ this odule ipleents linear classifier (Perceptron) x () z x (2) x (d) z
MNN as Nonlinear Mapping Thus MNN can be thought as learning 2 things at the sae tie the nonlinear apping of the inputs linear classifier of the nonlinearly apped inputs
MNN as Nonlinear Mapping original feature space x; patterns are not linearly separable MNN finds nonlinear apping yϕ(x) to 2 diensions (2 hidden units); patterns are alost linearly separable MNN finds nonlinear apping yϕ(x) to 3 diensions (3 hidden units) that; patterns are linearly separable
Concluding Rears Adantages MNN can learn coplex appings fro inputs to outputs, based only on the training saples Easy to use Easy to incorporate a lot of heuristics Disadantages It is a blac box, that is difficult to analyze and predict its behaior May tae a long tie to train May get trapped in a bad local inia A lot of trics to ipleent for the best perforance