Machne Learnng CS-57A Artfcal Neural Networks Burchan (bourch-khan) Bayazt http://www.cse.wustl.edu/~bayazt/courses/cs57a/ Malng lst: cs-57a@cse.wustl.edu Artfcal Neural Networks (ANN) Neural network nspred by bologcal nervous systems, such as our bran Useful for learnng real-valued, dscrete-valued or vector-valued functons. Appled to problems such as nterpretng vsual scenes, speech recognton, learnng robot control strateges. orks well wth nosy, complex sensor data such as nputs from cameras and mcrophones. ANN ANN nspraton from Neurobology A neuron: many-nputs / one-output unt Cell Body s 5 0 mcrons n dameter ncomng sgnals from other neurons determne f the neuron shall excte ("fre") Axon turn the processed nputs to outputs. Synapses are the electrochemcal contact between neurons. ANN n human bran, approxmately 0 neurons are densely nterconnected. They are arranged n networks Each neuron connected to 0 4 others on average Fastest neuron swtchng tme 0 - seconds ANN motvaton by bologcal neuron systems; however many features are nconsstent wth bologcal systems. ANN Short Hstory McCulloch & Ptts (94) are generally recognzed as the desgners of the frst neural network Ther deas such as threshold and many smple unts combnng to gve ncreased computatonal power are stll n use today n the 50 s and 60 s, many researchers worked on the perceptron n 969, Mnsky and Papert showed that perceptrons were lmted so neural network research ded down for about 5 years n the md 80 s nterest revved (Parket and LeCun)
ANN Hopfeld Network One Layer Perceptron Two Layer Perceptron Types Neural Network Archtectures Many knds of structures, man dstncton made between two classes: a) feed- forward (a drected acyclc graph (DAG): lnks are undrectonal, no cycles - There s no nternal state other than the weghts. Perceptron x x o =, w o nput Perceptron n f = = w x o 0 0 otherwse Output b) recurrent: lnks form arbtrary topologes e.g., Hopfeld Networks and Boltzmann machnes x n n = w x 0 Recurrent networks: can be unstable, or oscllate, or exhbt chaotc behavor e.g., gven some nput values, can take a long tme to compute stable output and learnng s made more dffcult. However, can mplement more complex agent desgns and can model systems wth state The McCullogh-Ptts model The perceptron calculates a weghted sum of nputs and compares t to a threshold. f the sum s hgher than the threshold, the output s set to, otherwse to -. Learnng s fndng weghts w g = Actvaton functons for unts Perceptron Mathematcal Representaton Step functon (Lnear Threshold Unt) step(x) =, f x >= threshold 0, f x < threshold Sgn functon sgn(x) = +, f x >= 0 -, f x < 0 Sgmod functon sgmod(x) = /(+e -x ) Addng an extra nput wth actvaton a 0 = - and weght 0,j = t s equvalent to havng a threshold at t. Ths way we can always assume a 0 threshold. f wo + w x +... + wn xn > 0 o( x, x,..., xn ) = otherwse r o r r ( x) sgn( w x) = where r r sgn( = { ( n H w w R +) } f y > 0 y) = otherwse
Perceptron o( x r ) defnes N-dmensonal space and (N-) dmensonal plane. The perceptron returns for data ponts lyng on one sde of the hyperplane and - for data ponts lyng on the other sde. f the postve and negatve examples are separated by a hyperplane, they are called lnearly separable sets of examples. But t s not always the case. Perceptron The equaton below descrbes a (hyper-)plane n the nput space consstng of real valued m-dmensonal vectors. The plane splts the nput space nto two regons, each of them descrbng one class. decson regon for C m x w x + w x + w 0 >= 0 w x + w0 = 0 decson = boundary C C x w x + w x + w 0 = 0 Perceptron Learnng Perceptron Learnng e have ether (-) or (+) as the output and nputs are ether 0 or There are 4 cases The output s suppose to be + and perceptron returns + The output s suppose to be - and perceptron returns - The output s suppose to be + and perceptron returns - The output s suppose to be - and perceptron returns + For each tranng data <x,t> D Fnd o=o(x) update each weght w =Δw +w where Δw =(t-o)x f Case or, do nothng snce the perceptron returns rght result f Case w 0 +w x +w x +.+w n x n >0 we need to ncrease the weghts so that the left sde of the equaton wll become greater than 0 f Case 4, the weghts must be decreased w w + Δw Δw = η So we can use followng update rule that satsfes ths ( t o) x t s the target output, o s the output generated by the perceptron and η s a postve constant known as the learnng rate. Learnng AND functon Learnng AND functon nput (0,) (,) Tranng Data: (0,,0) (0,0,0) (,0,0) (,,) w w w0 w 0 =- w =0.6 w =0.6 (0,0) (,0) nput
Learnng AND functon Output space for AND gate nput (0,) (,) Tranng Data: (0,,0) (0,0,0) (,0,0) (,,) Lmtatons of the Perceptron Only bnary nput-output values Only two layers Separates the space lnearly w 0 +w *x + w *x =0 (0,0) (,0) nput Only two layers Learnng OR Mnsky and Papert (969) showed that a two-layer Perceptron cannot represent certan logcal functons nput (0,) (,) Not Lnearly Separable Some of these are very fundamental, n partcular the exclusve or (OR) Do you want coffee OR tea? nput (0,0) (,0) Learnng OR Soluton to Lnear nseparablty nput Not Lnearly Separable Use another tranng rule (delta rule) Backpropagaton (0,) (,) nput (0,0) (,0) 4
ANN Gradent Descent and the Delta Rule Delta Rule desgned to converge examples that are not lnearly separable. Uses gradent descent to search the hypothess space of possble weght vectors to fnd the weghts that best ft the tranng examples. Gradent Descent Defne an error functon based on target concepts and NN output The goal s to change weghts so that the error wll be reduces (w,w ) (w +Δw,w +Δw ) Gradent Descent Tranng error of a hypothess: E ( w ) = ( t d o d ) d D D s the set of tranng examples, T d s the target output for tranng example d, and o d s the output of the lnear unt for tranng example d. How to fnd Δw? Dervaton of the Gradent Descent Rule r E E E ( w) =, E,..., 0 n r r r r r w w+δw Δw = η E w where ( ) Drecton of the steepest descent along the error surface The negatve sgn s present as we want to go n the drecton that decreases E. For the th component: w w + Δw where E Δw = η w How to fnd Δw? Gradent-Descent Algorthm Hence E = E = = ( td od ) = ( td od ) d D ( ) ( td od ) td od d D ( td od )( xd ) d D d D ( ) ( td w xd ) td od = d D E here x d Δw = η ( td od ) d D x d for tranng example d r s the sngle nput component x for r Each tranng example s a par of the form r x,t where x r s the vector of nput values, and t s the target output value and η s the learnng rate (e.g. 0.5) ntalze each w to some small random value Untl the termnaton condton s met, Do ntalze each Δw to zero. r For each x,t n tranng examples, Do nput the nstance x r to the unt and compute the output o For each lnear unt weght w, Do Δw Δw + η( t o) x For each lnear unt weght w, Do w w + Δw 5
Tranng Strateges Onlne tranng: Update weghts after each sample Offlne (batch tranng): Compute error over all samples Then update weghts Onlne tranng nosy Senstve to ndvdual nstances However, may escape local mnma Example: Learnng addton Example: Learnng addton nput Layer Hdden Layer 0 0 0 Goal: Learn bnary addton:.e.: (0+0)=0,(0+)=,(+0)=,(+)=0 Output Layer 0 0 0 0 0 0 0 Actvaton Functon Tranng Data nputs 0,0 0,,0, Target Concept 0,0 0, 0,,0 Example: Learnng addton Example: Learnng addton 0 0 0 Frst fnd the outputs O, O n order to do ths, propagate the nputs forward. Frst fnd the outputs for the neurons of hdden layer 0 0 0 Then fnd the outputs of the neurons of hdden layer 0 0 0 0 6
Example: Learnng addton Example: Learnng addton 0 0 0 Now propagate back the errors. n order to do that frst fnd the errors for the output layer, also update the weghts between hdden layer and output layer 0 0 0 And backpropagate the errors to hdden layer. 0 0 0 0 Example: Learnng addton Example: Learnng addton 0 And backpropagate the errors to hdden layer. 0 Fnally update weghts!!!! 0 0 0 0 0 0 0 0 mportance of Learnng Rate Generalzaton of the Backpropagaton 0.0 50 7
Backpropagaton Usng Gradent Descent Advantages Relatvely smple mplementaton Standard method and generally works well Dsadvantages Slow and neffcent Can get stuck n local mnma resultng n sub-optmal solutons Local Mnma Local Mnmum Global Mnmum Alternatves To Gradent Descent Smulated Annealng Advantages Can guarantee optmal soluton (global mnmum) Dsadvantages May be slower than gradent descent Much more complcated mplementaton Alternatves To Gradent Descent Genetc Algorthms/Evolutonary Strateges Advantages Faster than smulated annealng Less lkely to get stuck n local mnma Dsadvantages Slower than gradent descent Memory ntensve for large nets Alternatves To Gradent Descent Smplex Algorthm Advantages Smlar to gradent descent but faster Easy to mplement Dsadvantages Does not guarantee a global mnmum Enhancements To Gradent Descent Momentum Adds a percentage of the last movement to the current movement 8
Enhancements To Gradent Descent Momentum Useful to get over small bumps n the error functon Often fnds a mnmum n less steps Δw j (t) = -η*δ j *x j + α*w j (t-) Backpropagaton Drawback Slow convergence mprove ncrease learnng rates? 0 0 - - - - - 0 - - - 0 Bas Hard to characterze Smooth nterpretaton between data ponts Overfttng Use a valdaton set, keep the weghts for most accurate learnng Decay weghts Use several networks and use votng K-fold cross valdaton:. Dvde nput set to K small sets. For k=..k. use Set k as valdaton set, and the remanng as the test set 4. fnd the number of teratons k to optmal learnng for ths set 5. Fnd the average of number of teratons for all sets 6. Tran the network wth that number of teratons. Despte ts popularty backpropagaton has some dsadvantages Learnng s slow New learnng wll rapdly overwrte old representatons, unless these are nterleaved (.e., repeated) wth the new patterns Ths makes t hard to keep networks up-todate wth new nformaton (e.g., dollar rate) Ths also makes t very mplausble from as a psychologcal model of human memory Good ponts Easy to use Few parameters to set Algorthm s easy to mplement Can be appled to a wde range of data s very popular Has contrbuted greatly to the new connectonsm (second wave) 9
Defcences of BP Nets Learnng often takes a long tme to converge Complex functons often need hundreds or thousands of epochs The net s essentally a black box f may provde a desred mappng between nput and output vectors (x, y) but does not have the nformaton of why a partcular x s mapped to a partcular y. t thus cannot provde an ntutve (e.g., causal) explanaton for the computed result. Ths s because the hdden unts and the learned weghts do not have a semantcs. hat can be learned are operatonal parameters, not general, abstract knowledge of a doman Gradent descent approach only guarantees to reduce the total error to a local mnmum. (E may be be reduced to zero) Cannot escape from the local mnmum error state Not every functon that s represent able can be learned How bad: depends on the shape of the error surface. Too many valleys/wells wll make t easy to be trapped n local mnma Possble remedes: Try nets wth dfferent # of hdden layers and hdden unts (they may lead to dfferent error surfaces, some mght be better than others) Try dfferent ntal weghts (dfferent startng ponts on the surface) Forced escape from local mnma by random perturbaton (e.g., smulated annealng) Generalzaton s not guaranteed even f the error s reduced to zero Over-fttng/over-tranng problem: traned net fts the tranng samples perfectly (E reduced to 0) but t does not gve accurate outputs for nputs not n the tranng set Unlke many statstcal methods, there s no theoretcally well-founded way to assess the qualty of BP learnng hat s the confdence level one can have for a traned BP net, wth the fnal E (whch not or may not be close to zero) Kohonen Kohonen For each tranng data Fnd the wnner neuron usng Update the weghts of the neghbors Every neuron of the output layer s connected wth every neuron of the nput layer. hle learnng, the closest neuron to the nput data (the dstance between ts weghts and the nput vector s mnmum) and ts neghbors (see below) update ther weghts. The dstance s defned as follows: The formula for the Kohonen map tends to brng the connectons closer to the nput data: Kohonen Maps Kohonen Maps The nput x s gven to all the unts at the same tme 0
Kohonen Maps NETtalk (Sejnowsk & Rosenberg, 987) Kller Applcaton The weghts of the wnner unt are updated together wth the weghts of ts neghborhoods The task s to learn to pronounce Englsh text from examples. Tranng data s 04 words from a sde-by-sde Englsh/phoneme source. nput: 7 consecutve characters from wrtten text presented n a movng wndow that scans text. Output: phoneme code gvng the pronuncaton of the letter at the center of the nput wndow. Network topology: 7x9 nputs (6 chars + punctuaton marks), 80 hdden unts and 6 output unts (phoneme code). Sgmod unts n hdden and output layer. NETtalk (contd.) Tranng protocol: 95% accuracy on tranng set after 50 epochs of tranng by full gradent descent. 78% accuracy on a set-asde test set. Comparson aganst Dectalk (a rule based expert system): Dectalk performs better; t represents a decade of analyss by lngusts. NETtalk learns from examples alone and was constructed wth lttle knowledge of the task. Steerng an Automoble ALVNN system [Pomerleau 99,99] Uses Artfcal Neural Network Used 0* TV mage as nput (960 nput node) 5 Hdden node 0 output node Tranng regme: modfed on-the-fly A human drver drves the car, and hs actual steerng angles are taken as correct labels for the correspondng nputs. Shfted and rotated mages were also used for tranng. ALVNN has drven for 0 consecutve klometers at speeds up to 00km/h. Steerng an Automoble- ALVNN network Voce Recognton Task: Learn to dscrmnate between two dfferent voces sayng Hello Data Sources Steve Smpson Davd Raubenhemer Format Frequency dstrbuton (60 bns)
Network archtecture Presentng the data Feed forward network 60 nput (one for each frequency bn) 6 hdden output (0- for Steve, -0 for Davd ) Steve Davd