Slide03 Historical Overview Haykin Chapter 3 (Chap 1, 3, 3rd Ed): Single-Layer Perceptrons Multiple Faces of a Single Neuron Part I: Adaptive Filter

Size: px

Start display at page:

Download "Slide03 Historical Overview Haykin Chapter 3 (Chap 1, 3, 3rd Ed): Single-Layer Perceptrons Multiple Faces of a Single Neuron Part I: Adaptive Filter"

Mildred Hodge
6 years ago
Views:

1 Slide3 Haykin Chaper 3 (Chap, 3, 3rd Ed): Single-Layer Perceprons CPSC Insrucor: Yoonsuck Choe Hisorical Overview McCulloch and Pis (943): neural neworks as compuing machines. Hebb (949): posulaed he firs rule for self-organizing learning. Rosenbla (958): percepron as a firs model of supervised learning. Widrow and Hoff (96): adapive filers using leas-mean-square (LMS) algorihm (dela rule). 2 Muliple Faces of a Single Neuron Wha a single neuron does can be viewed from differen perspecives: Adapive filer: as in signal processing Classifier: as in percepron The wo aspecs will be reviewed, in he above order. Par I: Adapive Filer 3 4

2 Adapive Filering Problem Consider an unknown dynamical sysem, ha akes m inpus and generaes one oupu. Behavior of he sysem described as is inpu/oupu pair: T : {x(i), d(i); i =, 2,..., n,...} where x(i) = [x (i), x 2 (i),..., x m (i)] T is he inpu and d(i) he desired response (or arge signal). Inpu vecor can be eiher a spaial snapsho or a emporal sequence uniformly spaced in ime. There are wo imporan processes in adapive filering: Filering process: generaion of oupu based on he inpu: y(i) = x T (i)w(i). Adapaive process: auomaic adjusmen of weighs o reduce error: e(i) = d(i) y(i). 5 Unconsrained Opimizaion Techniques How can we adjus w(i) o gradually minimize e(i)? Noe ha e(i) = d(i) y(i) = d(i) x T (i)w(i). Since d(i) and x(i) are fixed, only he change in w(i) can change e(i). In oher words, we wan o minimize he cos funcion E(w) wih respec o he weigh vecor w: Find he opimal soluion w. The necessary condiion for opimaliy is E(w ) =, where he gradien operaor is defined as [ ] T =,,... w w 2 w m Wih his, we ge E(w ) = [ E w, 6 ] E E T,.... w 2 w m Seepes Descen We wan he ieraive updae algorihm o have he following propery: E(w(n )) < E(w(n)). Define he gradien vecor E(w) as g. The ieraive weigh updae rule hen becomes: w(n ) = w(n) ηg(n) where η is a small learning-rae parameer. So we can say, w(n) = w(n ) w(n) = ηg(n) Seepes Descen (con d) We now check if E(w(n )) < E(w(n)). Using firs-order Taylor expansion of E( ) near w(n), E(w(n )) E(w(n)) g T (n) w(n) and w(n) = ηg(n), we ge E(w(n )) E(w(n)) ηg T (n)g(n) So, i is indeed (for small η): = E(w(n)) η g(n) 2. }{{} Posiive! E(w(n )) < E(w(n)). 7 Taylor series: f(x) = f(a) f (a)(x a) f (a)(xa) 2 2!... 8

3 Seepes Descen: Example Seepes Descen: Anoher Example x*xy*y y 7 Gradien of x*xy*y Convergence o opimal w is very slow x Small η: overdamped, smooh rajecory Large η: underdamped, jagged rajecory η oo large: algorihm becomes unsable 9 For f(x) = f(x, y) = x 2 y 2, [ ] f(x, y) = f x, f T y = [2x, 2y] T. Noe ha () he gradien vecors are poining upward, away from he origin, (2) lengh of he vecors are shorer near he origin. If you follow f(x, y), you will end up a he origin. We can see ha he gradien vecors are perpendicular o he level curves. * The vecor lenghs were scaled down by a facor of o avoid cluer. Newon s Mehod Newon s mehod is an exension of seepes descen, where he second-order erm in he Taylor series expansion is used. I is generally faser and shows a less erraic meandering compared o he seepes descen mehod. There are cerain condiions o be me hough, such as he Hessian marix 2 E(w) being posiive definie (for an arbiarry x, x T Hx > ). Gauss-Newon Mehod Applicable for cos-funcions expressed as sum of error squares: E(w) = 2 n e i (w) 2, i= where e i (w) is he error in he i-h rial, wih he weigh w. Recalling he Taylor series f(x) = f(a) f (a)(x a)..., we can express e i (w) evaluaed near e i (w k ) as [ ] T ei e i (w) = e i (w k ) (w w k ). w w=w k In marix noaion, we ge: e(w) = e(w k ) J e (w k )(w w k ). * We will use a slighly differen noaion han he exbook, for clariy. 2

4 Gauss-Newon Mehod (con d) Quick Example: Jacobian Marix J e (w) is he Jacobian marix, where each row is he gradien of e i (w): J e (w) = e w e w 2... e 2 w e 2 w 2... e wn e 2 wn : : : : : : en w en w 2... en wn = ( e (w)) T ( e 2 (w)) T : : ( e n (w)) T Given e(x, y) = e (x, y) e 2 (x, y) = The Jacobian of e(x, y) becomes J e (x, y) = [ e (x,y) x e 2 (x,y) x e (x,y) y e 2 (x,y) y x 2 y 2 cos(x) sin(y) ] = [ 2x sin(x), 2y cos(y) ]. We can hen evaluae J e (w k ) by plugging in acual values of w k ino he Jabobian marix above. For (x, y) = (.5π, π), we ge [ π J e (.5π, π) = sin(.5π) 2π cos(π) ] = [ π 2π ]. 3 4 Gauss-Newon Mehod (con d) Linear Leas-Square Filer Again, saring wih e(w) = e(w k ) J e (w k )(w w k ), Given m inpu and oupu funcion y(i) = φ(x T i w i) where φ(x) = x, i.e., i is linear, and a se of raining samples {x i, d i } n i=, we can define he error vecor for an arbirary weigh w as wha we wan is o se w so ha he error approaches. Tha is, we wan o minimize he norm of e(w): e(w) 2 = e(w k ) 2 2e(w k ) T J e (w k )(w w k ) e(w) = d [x, x 2,..., x n ] T w. where d = [d, d 2,..., d n ] T. Seing X = [x, x 2,..., x n ] T, we ge: e(w) = d Xw. (w w k ) T J T e (w k)j e (w k )(w w k ). Differeniaing he above wr w and seing he resul o, we ge Differeniaing he above wr w, we ge e(w) = X T. So, he Jacobian becomes J e (w) = ( e(w)) T = X. Plugging his in o he Gauss-Newon equaion, we finally ge: J T e (w k)e(w k )J T e (w k)j e (w k )(ww k ) =, from which we ge w = w k (J T e (w k)j e (w k )) J T e (w k)e(w k ). * J T e (w k)j e (w k ) needs o be nonsingular (inverse is needed). 5 w = w k (X T X) X T (d Xw k ) = w k (X T X) X T d (X T X) X T Xw k }{{} = (X T X) X T d. 6 This is Iw k = w k.

5 Poins worh noing: Linear Leas-Square Filer (con d) X does no need o be a square marix! We ge w = (X T X) X T d off he ba parly because he oupu is linear (oherwise, he formula would be more complex). The Jacobian of he error funcion only depends on he inpu, and is invarian wr he weigh w. The facor (X T X) X T (le s call i X ) is like an inverse. Muliply X o boh sides of d = Xw Linear Leas-Square Filer: Example See src/pseudoinv.m. X = ceil(rand(4,2)*), wrue = rand(2,)*, d=x*wrue, w = inv(x *X)*X *d X = wrue = d = hen we ge: w = X d = X X w. }{{} =I 7 w = Leas-Mean-Square Algorihm Cos funcion is based on insananeous values. E(w) = 2 e2 (w) Differeniaing he above wr w, we ge E(w) w Pluggin in e(w) = d x T w, e(w) w e(w) = e(w) w. E(w) = x, and hence w = xe(w). Using his in he seepes descen rule, we ge he LMS algorihm: ŵ n = ŵ n ηx n e n. Noe ha his weigh updae is done wih only one (x i, d i ) pair! 9 Leas-Mean-Square Algorihm: Evaluaion LMS algorihm behaves like a low-pass filer. LMS algorihm is simple, model-independen, and hus robus. LMS does no follow he direcion of seepes descen: Insead, i follows i sochasically (sochasic gradien descen). Slow convergence is an issue. LMS is sensiive o he inpu correlaion marix s condiion number (raio beween larges vs. smalles eigenvalue of he correl. marix). LMS can be shown o converge if he learning rae has he following propery: < η < 2 λ max where λ max is he larges eigenvalue of he correl. marix. 2

6 Improving Convergence in LMS Search-Then-Converge in LMS The main problem arises because of he fixed η. One soluion: Use a ime-varying learning rae: η(n) = c/n, as in sochasic opimizaion heory. A beer alernaive: use a hybrid mehod called search-hen-converge. η(n) = η (n/τ) When n < τ, performance is similar o sandard LMS. When n > τ, i behaves like sochasic opimizaion. η(n) = η n vs. η(n) = η (n/τ) 2 22 The Percepron Model Par II: Percepron Percepron uses a non-linear neuron model (McCulloch-Pis model). v = m if v > w i x i b, y = φ(v) = if v i= Goal: classify inpu vecors ino wo classes

7 Boolean Logic Gaes wih Percepron Unis Wha Perceprons Can Represen =.5 = W2= =.5 =.5 AND = W2= OR = NOT Russel & Norvig w w Oupu = Slope = W Perceprons can represen basic boolean funcions. Oupu=fs Thus, a nework of percepron unis can compue any Boolean funcion. Wha abou XOR or EQUIV? Perceprons can only represen linearly separable funcions. Oupu of he percepron: W I W I >, hen oupu is W I W I, hen oupu is Geomeric Inerpreaion w w Oupu=fs Oupu = Slope = W w w = The Role of he Bias = Slope = W Rearranging W I W I >, hen oupu is, we ge (if W > ) I > W W I W, where poins above he line, he oupu is, and for hose below he line. Compare wih y = W x. W 27 W Wihou he bias ( = ), learning is limied o adjusmen of he slope of he separaing line passing hrough he origin. Three example lines wih differen weighs are shown. 28

8 Limiaion of Perceprons w w Oupu=fs Oupu = Slope = W x Generalizing o n-dimensions z (x,y,z) n = [a b c] T (x,y,z ) y x y z a b c d hp://mahworld.wolfram.com/plane.hml Only funcions where he poins and poins are clearly linearly separable can be represened by perceprons. The geomeric inerpreaion is generalizable o funcions of n argumens, i.e. percepron wih n inpus plus one hreshold (or bias) uni. 29 n = (a, b, c), x = (x, y, z), x = (x, y, z ). Equaion of a plane: n ( x x ) = In shor, ax by cz d =, where a, b, c can serve as he weigh, and d = n x as he bias. For n-d inpu space, he decision boundary becomes a (n )-D hyperplane (-D less han he inpu space). 3 Linear Separabiliy Linear Separabiliy (con d) Linearlyseparable No Linearlyseparable No Linearlyseparable For funcions ha ake ineger or real values as argumens and oupu eiher or. Lef: linearly separable (i.e., can draw a sraigh line beween he classes). Righ: no linearly separable (i.e., perceprons canno represen such a funcion) AND OR XOR Perceprons canno represen XOR! Minsky and Paper (969)? 3 32

9 XOR in Deail Perceprons: A Differen Perspecive # I I XOR 2 w w Oupu = Slope = W x i w 3 4 Oupu=fs θ d W I W I >, hen oupu is : 2 W > W > 3 W > W > 4 W W W W 2 < W W < (from 2, 3, and 4), bu (from ), a conradicion. 33 w T x > b hen, oupu is w T x = w x cos θ > b hen, oupu is x cos θ > b w So, if d = x cos θ in he figure above is greaer han hen, oupu is b, hen oupu =. w Adjusing w changes he il of he decision boundary, and adjusing he bias b (and w ) moves he decision boundary closer or away from he origin. 34 Percepron Learning Rule Percepron Learning Rule (con d) Given a linearly separable se of inpus ha can belong o class C or C 2, The goal of percepron learning is o have w T x > for all inpu in class C w T x for all inpu in class C 2 If all inpus are correcly classified wih he curren weighs w(n), For misclassified inpus (η(n) is he learning rae): w(n ) = w(n) η(n)x(n) if w T x > and x C 2. w(n ) = w(n) η(n)x(n) if w T x and x C. Or, simply x(n ) = w(n) η(n)e(n)x(n), where e(n) = d(n) y(n) (he error). w(n) T x >, for all inpu in class C, and w(n) T x, for all inpu in class C 2, hen w(n ) = w(n) (no change). Oherwise, adjus he weighs

10 Learning in Percepron: Anoher Look Percepron Convergence Theorem x xw w x wx wx w Given a se of linearly separable inpus, Wihou loss of generaliy, assume η =, w() =. Assume he firs n examples C are all misclassified. Then, using w(n ) = w(n) x(n), we ge w(n ) = x() x(2)... x(n). () When a posiive example (C ) is misclassified, w(n ) = w(n) η(n)x(n). When a negaive example (C 2 ) is misclassified, w(n ) = w(n) η(n)x(n). Noe he il in he weigh vecor, and observe how i would change he decision boundary. 37 Since he inpu se is linearly separable, here is a leas on soluion w such ha w T x(n) > for all inpus in C. Define α = min x(n) C w T x(n) >. Muliply boh sides in eq. wih w, we ge: w T w(n) = wt x()wt x(2)...wt x(n). (2) From he wo seps above, we ge: w T w(n ) > nα (3) 38 Percepron Convergence Theorem (con d) Percepron Convergence Theorem (con d) Using Cauchy-Schwarz inequaliy ] w 2 w(n ) 2 [w T 2 w(n ) From he above and w T w(n ) > nα, So, finally, we ge w 2 w(n ) 2 n 2 α 2 w(n ) 2 n2 α 2 w 2 }{{} Firs main resul 39 (4) Taking he Euclidean norm of w(k ) = w(k) x(k), w(k ) 2 = w(k) 2 2w T (k)x(k) x(k) 2 Since all n inpus in C are misclassified, w T (k)x(k) for k =, 2,..., n, w(k ) 2 w(k) 2 x(k) 2 = 2w T (k)x(k), w(k ) 2 w(k) 2 x(k) 2 w(k ) 2 w(k) 2 x(k) 2 Summing up he inequaliies for all k =, 2,..., n, and w() =, we ge w(k ) 2 n x(k) 2 nβ, (5) k= where β = max x (k) C x(k) 2. 4

11 Percepron Convergence Theorem (con d) From eq. 4 and eq. 5, n 2 α 2 w 2 w(n ) 2 nβ Here, α is a consan, depending on he fixed inpu se and he fixed soluion w (so, w is also a consan), and β is also a consan since i depends only on he fixed inpu se. In his case, if n grows o a large value, he above inequaliy will becomes invalid (n is a posiive ineger). Fixed-Incremen Convergence Theorem Le he subses of raining vecors C and C 2 be linearly separable. Le he inpus presened o percepron originae from hese wo subses. The percepron converges afer some n ieraions, in he sense ha w(n ) = w(n ) = w(n 2) =... is a soluion vecor for n n max. Thus, n canno grow beyond a cerain n max, where n 2 max α2 w 2 = n maxβ n max = β w 2 α 2, and when n = n max, all inpus will be correcly classified 4 42 Summary Adapive filer using he LMS algorihm and perceprons are closely relaed (he learning rule is almos idenical). LMS and perceprons are differen, however, since one uses linear acivaion and he oher hard limiers. LMS is used in coninuous learning, while perceprons are rained for only a finie number of seps. Single-neuron or single-layer has severe limis: How can muliple layers help? 43 44

12 XOR wih Mulilayer Perceprons XOR AND Noe: he bias unis are no shown in he nework on he righ, bu hey are needed. Only hree percepron unis are needed o implemen XOR. However, you need wo layers o achieve his. 45

The Rosenblatt s LMS algorithm for Perceptron (1958) is built around a linear neuron (a neuron with a linear

The Rosenblatt s LMS algorithm for Perceptron (1958) is built around a linear neuron (a neuron with a linear In The name of God Lecure4: Percepron and AALIE r. Majid MjidGhoshunih Inroducion The Rosenbla s LMS algorihm for Percepron 958 is buil around a linear neuron a neuron ih a linear acivaion funcion. Hoever,