Last updated: Nov 26, 2012 MULTILAYER PERCEPTRONS
Outline 2 Combining Linea Classifies Leaning Paametes
Outline 3 Combining Linea Classifies Leaning Paametes
Implementing Logical Relations 4 AND and OR opeations ae linealy sepaable poblems
The XOR Poblem 5 XOR is not linealy sepaable. x 1 x 2 XOR Class 0 0 0 B 0 1 1 A 1 0 1 A 1 1 0 B How can we use linea classifies to solve this poblem?
Combining two linea classifies 6 Idea: use a logical combination of two linea classifies. g 2 (x) = x 1 + x 2 3 2 g 1 (x) = x 1 + x 2 1 2
Combining two linea classifies 7 Let f (x) be the unit step activation function: f (x) = 0, x < 0 f (x) = 1, x 0 Obseve that the classification poblem is then solved by f y 1 y 2 1 2 whee y 1 = f g 1 (x) ( ) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 g 1 (x) = x 1 + x 2 1 2
Combining two linea classifies 8 This calculation can be implemented sequentially: 1. Compute y 1 and y 2 fom x 1 and x 2. 2. Compute the decision fom y 1 and y 2. Each laye in the sequence consists of one o moe linea classifications. This is theefoe a two-laye pecepton. f y 1 y 2 1 2 whee y 1 = f g 1 (x) ( ) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 g 1 (x) = x 1 + x 2 1 2
The Two-Laye Pecepton 9 Laye 1 Laye 2 x 1 x 2 y 1 y 2 0 0 0(-) 0(-) B(0) 0 1 1(+) 0(-) A(1) 1 0 1(+) 0(-) A(1) 1 1 1(+) 1(+) B(0) f y 1 y 2 1 2 whee y 1 = f g 1 (x) ( ) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 Laye 1 Laye 2 y 1 y 2 1 2 g 1 (x) = x 1 + x 2 1 2
The Two-Laye Pecepton 10 The fist laye pefoms a nonlinea mapping that makes the data linealy sepaable. y 1 = f ( g 1 (x)) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 Laye 1 Laye 2 y 1 y 2 1 2 g 1 (x) = x 1 + x 2 1 2
The Two-Laye Pecepton Achitectue 11 Input Laye Hidden Laye Output Laye g 1 (x) = x 1 + x 2 1 2 y 1 y 2 1 2 g 2 (x) = x 1 + x 2 3 2-1
The Two-Laye Pecepton 12 Note that the hidden laye maps the plane onto the vetices of a unit squae. y 1 = f ( g 1 (x)) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 Laye 1 Laye 2 y 1 y 2 1 2 g 1 (x) = x 1 + x 2 1 2
Highe Dimensions 13 Each hidden unit ealizes a hypeplane disciminant function. The output of each hidden unit is 0 o 1 depending upon the location of the input vecto elative to the hypeplane. l x R x { 0,1} i 1, 2 p T y = [ y 1,... yp ], yi =,...
Highe Dimensions 14 Togethe, the hidden units map the input onto the vetices of a p-dimensional unit hypecube. l x R x y { 0,1} i 1, 2 p T = [ y 1,... yp ], yi =,...
Two-Laye Pecepton 15 These p hypeplanes patition the l-dimensional input space into polyhedal egions Each egion coesponds to a diffeent vetex of the p- dimensional hypecube epesented by the outputs of the hidden laye.
Two-Laye Pecepton 16 In this example, the vetex (0, 0, 1) coesponds to the egion of the input space whee: g 1 (x) < 0 g 2 (x) < 0 g 3 (x) > 0
Limitations of a Two-Laye Pecepton 17 The output neuon ealizes a hypeplane in the tansfomed space that patitions the p vetices into two sets. Thus, the two laye pecepton has the capability to classify vectos into classes that consist of unions of polyhedal egions. But NOT ANY union. It depends on the elative position of the coesponding vetices. How can we solve this poblem?
The Thee-Laye Pecepton 18 Suppose that Class A consists of the union of K polyheda in the input space. Use K neuons in the 2 nd hidden laye. Tain each to classify one Class A vetex as positive, the est negative. Now use an output neuon that implements the OR function.
The Thee-Laye Pecepton 19 Thus the thee-laye pecepton can sepaate classes esulting fom any union of polyhedal egions in the input space.
The Thee-Laye Pecepton 20 The fist laye of the netwok foms the hypeplanes in the input space. The second laye of the netwok foms the polyhedal egions of the input space The thid laye foms the appopiate unions of these egions and maps each to the appopiate class.
Outline 21 Combining Linea Classifies Leaning Paametes
Taining Data 22 The taining data consist of N input-output pais: ( y(i), x(i) ), i 1, N whee y(i) = and x(i) = y 1 (i),, y kl (i) x 1 (i),, x k0 (i) t t
Choosing an Activation Function 23 The unit step activation function means that the eo ate of the netwok is a discontinuous function of the weights. This makes it difficult to lean optimal weights by minimizing the eo. To fix this poblem, we need to use a smooth activation function. A popula choice is the sigmoid function we used fo logistic egession:
Smooth Activation Function 24 f (a) = 1 1+ exp( a) w t φ x 1
Output: Two Classes 25 Fo a binay classification poblem, thee is a single output node with activation function given by f (a) = 1 1+ exp( a) Since the output is constained to lie between 0 and 1, it can be intepeted as the pobability of the input vecto belonging to Class 1.
Output: K > 2 Classes 26 Fo a K-class poblem, we use K outputs, and the softmax function given by y k = j ( ) exp a k ( ) exp a j Since the outputs ae constained to lie between 0 and 1, and sum to 1, y k can be intepeted as the pobability that the input vecto belongs to Class K.
Non-Convex 27 Now each laye of ou multi-laye pecepton is a logistic egesso. Recall that optimizing the weights in logistic egession esults in a convex optimization poblem. Unfotunately the cascading of logistic egessos in the multi-laye pecepton makes the poblem non-convex. This makes it difficult to detemine an exact solution. Instead, we typically use gadient descent to find a locally optimal solution to the weights. The specific leaning algoithm is called the backpopagation algoithm.
Nonlinea Classification and Regession: Outline 28 Multi-Laye Peceptons The Back-Popagation Leaning Algoithm Genealized Linea Models Radial Basis Function Netwoks Spase Kenel Machines n Nonlinea SVMs and the Kenel Tick n Relevance Vecto Machines
The Backpopagation Algoithm Paul J. Webos. Beyond Regession: New Tools fo Pediction and Analysis in the Behavioal Sciences. PhD thesis, Havad Univesity, 1974 Rumelhat, David E.; Hinton, Geoffey E., Williams, Ronald J. (8 Octobe 1986). "Leaning epesentations by back-popagating eos". Natue 323 (6088): 533 536. Webos Rumelhat Hinton
Notation 30 Assume a netwok with L layes k 0 nodes in the input laye. k nodes in the th laye.
Notation 31 Let y k 1 be the output of the kth neuon of laye 1. Let w jk be the weight of the synapse on the jth neuon of laye fom the kth neuon of laye 1.
Input 32 y k 0 (i) = x k (i), k = 1,,k 0
Notation 33 Let v j v j (i) = w j be the total input to the jth neuon of laye : ( ) t k 1 y 1 (i) = w jk Then y j (i) = f v j (i) k =0 y k 1 (i) whee we define y 0 (i) = +1 to incopoate the bias tem. k 1 ( ) = f w jk k =0 y 1 k (i)
Cost Function 34 A common cost function is the squaed eo: J = N i =1 ε(i) whee ε(i) 1 2 and k L ( e m (i)) 2 = 1 2 m =1 k L ( y m (i) ŷ m (i)) 2 m =1 ŷ m (i) = y k (i) is the output of the netwok.
Cost Function 35 To summaize, the eo fo input i is given by ε(i) = 1 2 whee ŷ m (i) = y k (i) is the output of the output laye and each laye is elated to the pevious laye though and k L ( e m (i)) 2 = 1 2 m =1 ( ) y j (i) = f v j (i) v j (i) = ( w ) t j y 1 (i) k L ( ŷ m (i) y m (i)) 2 m =1
Gadient Descent 36 Gadient descent stats with an initial guess at the weights ove all layes of the netwok. We then use these weights to compute the netwok output ŷ(i) fo each input vecto x(i) in the taining data. ε(i) = 1 2 k L ( e m (i)) 2 = 1 2 m =1 k L ( ŷ m (i) y m (i)) 2 m =1 This allows us to calculate the eoε(i) fo each of these inputs. Then, in ode to minimize this eo, we incementally update the weights in the negative gadient diection: w j (new) = w j (old) - µ J w j = w j (old) - µ N i =1 ε(i) w j
Gadient Descent 37 ( ) t y 1 (i) Since v j (i) = w j, the influence of the jth weight of the th laye on the eo can be expessed as: ε(i) w j = ε(i) v j (i) v j (i) w j = δ j (i)y 1 (i) whee δ j (i) ε(i) v j (i)
Gadient Descent 38 ε(i) w j = δ j (i)y 1 (i), whee δ j (i) ε(i) v j (i) Fo an intemediate laye, we cannot compute δ j (i) diectly. Howeve, δ j (i) can be computed inductively, stating fom the output laye.
Backpopagation: The Output Laye 39 ε(i) w j = δ j (i)y 1 (i), whee δ j (i) ε(i) v j (i) and ε(i) = 1 2 k L ( e m (i)) 2 = 1 2 m =1 k L m =1 Recall that ŷ m (i) = y L j (i) = f ( v L j (i)) Thus at the output laye we have δ j L (i) = ε(i) v j L (i) = ε(i) e j L (i) ( ŷ m (i) y m (i)) 2 ( ) e L j (i) v L j (i) = e L (i) f v L j j (i) 1 f (a) = 1+ exp( a) f (a) = f (a) 1 f (a) δ j L (i) = e j L (i)f v j L (i) ( ) ( ) ( ) 1 f ( v L j (i))
Backpopagation: Hidden Layes 40 Obseve that the dependence of the eo on the total input to a neuon in a pevious laye can be expessed in tems of the dependence on the total input of neuons in the following laye: δ j 1 (i) = ε(i) v j 1 (i) = k ε(i) v k k (i) = δ v k (i) v 1 j (i) k (i) v (i) k v 1 j (i) k =1 k =1 whee v k (i) = k 1 k 1 w km y 1 m (i) = w km m =0 m =0 ( ) Thus we have v (i) k v 1 j (i) = w f v 1 kj j (i) and so δ j 1 (i) = ε(i) v 1 j (i) = f ( ) f v m 1 (i) k v ( 1 (i) j ) δ k (i)w kj ( ) 1 f ( v L j (i)) = f v j L (i) k ( ) δ k (i)w kj Thus once the δ k (i) ae detemined they can be popagated backwad to calculate δ j 1 (i) using this inductive fomula. k =1 k =1
41 Repeat until convegence Backpopagation: Summay of Algoithm 1. Initialization Initialize all weights with small andom values 2. Fowad Pass Fo each input vecto, un the netwok in the fowad diection, calculating: 3. Backwad Pass Stating with the output laye, use ou inductive fomula to compute the δ 1 j (i) : n n v j (i) = w j ( ) t y 1 (i); and finally ε(i) = 1 2 Output Laye (Base Case): 4. Update Weights Hidden Layes (Inductive Case): w j (new) = w j (old) - µ y j (i) = f v j (i) k L ( e m (i)) 2 = 1 2 m =1 N i =1 δ L j (i) = e L j (i) f ε(i) w j ( ) k L ( ŷ m (i) y m (i)) 2 m =1 δ 1 j (i) = f ( v L j (i)) whee ε(i) w j k ( v 1 j (i)) δ k (i)w kj k =1 = δ j (i)y 1 (i)
Batch vs Online Leaning 42 As descibed, on each iteation backpop updates the weights based upon all of the taining data. This is called batch leaning. w j (new) = w j (old) - µ N i =1 ε(i) w j whee ε(i) w j = δ j (i)y 1 (i) An altenative is to update the weights afte each taining input has been pocessed by the netwok, based only upon the eo fo that input. This is called online leaning. w j (new) = w j (old) - µ ε(i) w j whee ε(i) w j = δ j (i)y 1 (i)
Batch vs Online Leaning 43 One advantage of batch leaning is that aveaging ove all inputs when updating the weights should lead to smoothe convegence. On the othe hand, the andomness associated with online leaning might help to pevent convegence towad a local minimum. Changing the ode of pesentation of the inputs fom epoch to epoch may also impove esults.
Remaks 44 Local Minima The objective function is in geneal non-convex, and so the solution may not be globally optimal. Stopping Citeion Typically stop when the change in weights o the change in the eo function falls below a theshold. Leaning Rate The speed and eliability of convegence depends on the leaning ate μ.