Perceptron. Inner-product scalar Perceptron. XOR problem. Gradient descent Stochastic Approximation to gradient descent 5/10/10

Perceptro Ier-product scalar Perceptro Perceptro learig rule XOR problem liear separable patters Gradiet descet Stochastic Approximatio to gradiet descet LMS Adalie 1

Ier-product et =< w, x >= w x cos(θ) et = i=1 w i x i A measure of the projectio of oe vector oto aother Example 2

Activatio fuctio o = f (et) = f ( w i x i ) i=1 f (x) := sg(x) = 1 if x 0 1 if x < 0 f (x) := ϕ(x) = 1 if x 0 0 if x < 0 1 if x 0.5 f (x) := ϕ(x) = x if 0.5 > x > 0.5 0 if x 0.5 sigmoid fuctio f (x) := σ(x) = 1 1+ e ( ax ) 3

Graphical represetatio Similarity to real euros... 4

10 40 Neuros 10 4-5 coectios per euro Perceptro x 1 x 2. x Liear threshold uit (LTU) w 2 w w 1 X 0 =1 Σ w 0 1 if et 0 et = w i x i o = sg(et) = 1 if et < 0 i= 0 The bias, a costat term that does ot deped o ay iput value o McCulloch-Pitts model of a euro 5

The goal of a perceptro is to correctly classify the set of patter D={x 1,x 2,..x m } ito oe of the classes C 1 ad C 2 The output for class C 1 is o=1 ad fo C 2 is o=-1 For =2 Liearly separable patters o = sg( w i x i ) w i x i > 0 for C 0 i= 0 i= 0 w i x i 0 for C 1 i= 0 X 0 =1, bias... 6

Perceptro learig rule Cosider liearly separable problems How to fid appropriate weights Look if the output patter o belogs to the desired class, has the desired value d w ew = w old + Δw Δw = η(d o) η is called the learig rate 0 < η 1 I supervised learig the etwork has its output compared with kow correct aswers Supervised learig Learig with a teacher (d-o) plays the role of the error sigal 7

Perceptro The algorithm coverges to the correct classificatio if the traiig data is liearly separable ad η is sufficietly small Whe assigig a value to η we must keep i mid two coflictig requiremets Averagig of past iputs to provide stable weights estimates, which requires small η Fast adaptatio with respect to real chages i the uderlyig distributio of the process resposible for the geeratio of the iput vector x, which requires large η Several odes o 1 = sg( w 1i x i ) i= 0 o 2 = sg( w 2i x i ) i= 0 8

o j = sg( w ji x i ) i= 0 Costructios 9

5/10/10 Frak Roseblatt 1928-1969 10

5/10/10 11

Roseblatt's bitter rival ad professioal emesis was Marvi Misky of Caregie Mello Uiversity Misky despised Roseblatt, hated the cocept of the perceptro, ad wrote several polemics agaist him For years Misky crusaded agaist Roseblatt o a very asty ad persoal level, icludig cotactig every group who fuded Roseblatt's research to deouce him as a charlata, hopig to rui Roseblatt professioally ad to cut off all fudig for his research i eural ets XOR problem ad Perceptro By Misky ad Papert i mid 1960 12

Gradiet Descet To uderstad, cosider simpler liear uit, where o = i= 0 w i x i Let's lear w i that miimize the squared error, D={(x 0,t 0 ),(x 1,t 1 ),..,(x,t )} (t for target) o bias ay more 13

Error for differet hypothesis, for w 0 ad w 1 (dim 2) We wat to move the weigth vector i the directio that decrease E w i =w i +Δw i w=w+δw 14

The gradiet Differetiatig E 15

Update rule for gradiet decet Δw i = η d D (t d o d )x id Gradiet Descet Gradiet-Descet(traiig_examples, η) Each traiig example is a pair of the form <(x 1, x ),t> where (x 1,,x ) is the vector of iput values, ad t is the target output value, η is the learig rate (e.g. 0.1) Iitialize each w i to some small radom value Util the termiatio coditio is met, Do Iitialize each Δw i to zero For each <(x 1, x ),t> i traiig_examples Do Iput the istace (x 1,,x ) to the liear uit ad compute the output o For each liear uit weight w i Do Δw i = Δw i + η (t-o) x i Δw i = η (t d o d )x id For each liear uit weight w i d D Do w i =w i +Δw i 16

Stochastic Approximatio to gradiet descet Δw i = η(t o)x i The gradiet decet traiig rule updates summig over all the traiig examples D Stochastic gradiet approximates gradiet decet by updatig weights icremetally Calculate error for each example Kow as delta-rule or LMS (last mea-square) weight update Adalie rule, used for adaptive filters Widroff ad Hoff (1960) 17

LMS Estimate of the weight vector No steepest decet No well defied trajectory i the weight space Istead a radom trajectory (stochastic gradiet descet) Coverge oly asymptotically toward the miimum error Ca approximate gradiet descet arbitrarily closely if made small eough η Summary Perceptro traiig rule guarateed to succeed if Traiig examples are liearly separable Sufficietly small learig rate η Liear uit traiig rule uses gradiet descet or LMS guarateed to coverge to hypothesis with miimum squared error Give sufficietly small learig rate η Eve whe traiig data cotais oise Eve whe traiig data ot separable by H 18

Ier-product scalar Perceptro Perceptro learig rule XOR problem liear separable patters Gradiet descet Stochastic Approximatio to gradiet descet LMS Adalie XOR? Multi-Layer Networks output layer hidde layer iput layer 19