Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Size: px

Start display at page:

Download "Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons"

Arnold Burns
6 years ago
Views:

1 Slide04 supplemental) Haykin Chapter 4 bth 2nd and 3rd ed): Multi-Layer Perceptrns CPSC Instructr: Ynsuck Che Heuristic fr Making Backprp Perfrm Better 1. Sequential vs. batch update: fr large and highly redundant data sets, sequential update wrks well. 2. Maximizatin f infrmatin cntent: Every training sample shuld be chsen t maximize infrmatin cntent we want t search mre f the weight space). Use examples that result in the largest errr. Use examples that are radically different frm thse previusly used. Randmize shuffle) input presentatin rder. Use an emphasizing scheme: present mre difficult inputs. Issues: input distributin is distrted, and utlier r mislabeled input can cause serius prblems. 1 2 Heuristic fr Backprp cnt d) Heuristic fr Backprp cnt d) Mean remval Decrrelatin Cvariance Equalizatin 3. Activatin functin: Usually learning is faster with antisymmetric activatin functins. φ v) = φv). A ppular example is φv) = a tanhbv). Nte: φ v) = ab1 tanh 2 bv)). 4. Target values: target values shuld be within the range f the sigmid activatin functin. 5. Nrmalizing the inputs: preprcessing t make certain statistical prperties hld ver the entire training set is imprtant. 6. Initializatin: Small initial weight values are usually gd, since they prevent saturatin f activity, but it is nt gd either t have t small weights. One suggestin is t initialize the weights t randm values frm a unifrmly randm distributin with zer mean and standard deviatin f σ w = m 1/2, where m is the number f cnnectins fr that neurn. 7. Learning frm hints: Include prir infrmatin abut the functin t be learned f ), e.g., invariance prperties, symmetries, etc. 8. Learning rates: hidden layer shuld have higher learning rate that utput layer utput layer tends t have larger lcal gradients). Neurns with many inputs shuld have smaller learning rates. 3 4

2 Weight Initializatin Weight Initializatin cnt d) Input: µ y = E[y i ] = 0, σ 2 y = E[y i µ y ) 2 ] = E[y 2 ] = 1. Als assume E[y i y k ] = 1 if k = i and 0 if k i inputs are uncrrelated). Weights: µ w = E[w ji ] = 0, and σ w = E[w ji µ w ) 2 ] = E[w 2 ji ]. Given an input distributin zer mean, unit variance), what is the distributin f the induced lcal field the weighted sum) that gets plugged int the sigmid φ )? We want this value v t be within the ramp regin f the sigmid. Fr this, what shuld the weight distributin lk like? 5 Induced lcal field: [ m ] µ v = E[v j ] = E w ji y i = E[w ji ]E[y i ] = 0. σ 2 v = E [ v j µ v ) 2] = E[v 2 j ] = E [ m = k=1 E[w ji w jk ]E[y i y k ] = k=1 ] w ji w jk y i y k E[w 2 ji ] = mσ2 w. We want t adjust σ v = m 1/2 σ w t fit within the the ramp f φ ). 6 Supervised Learning as an Optimizatin Prblem Newtn s Methds vs. Grad. Descent Supervised training f MLP as a prblem f numerical ptimizatin f E av w) averaged ver all input samples. Taking the Taylr series expansin: E av wn) + wn)) = E av wn)) + g T n) wn) wt n)hn) wn) + HOT, where gn) = E avw) w and Hn) = 2 E av w) w=wn) w 2 First-rder methd: gradient descent linear apprx) wn) = ηgn) Secnd-rder methd: Newtn s methd quadratic apprx) a w n) = H 1 n)gn) a Nte: typ in this eq in the bk need t add - ) 7 w=wn) surce: wikipedia Gradient descent green): linear search Newtn s methd red): quadratic search. 8

3 Cnjugate Gradient Methd: Intrductin Newtn s methd, using the inverse f the Hessian H 1 n), has limitatins: Cmputatin f the Hessian can be expensive, The Hessian needs t be nnsingular fr the inverse), which is rarely the case. When the cst functin is nn-quadratic, cnvergence is nt guaranteed. Cnjugate gradient methd vercmes the abve prblems while accelerating cnvergence it s a secnd-rder methd). Quadratic Functin Minimizatin Cnsider minimizing the quadratic functin A is sym. ps. def.): which is minimized fr fx) = 1 2 xt Ax b T x + c x = A 1 b. S, the minimizatin prblem turns int slving a system f linear equatins: Ax = b. Cnjugate vectrs: Given an matrix A, nnzer vectrs s0), s1),...sw 1) is A-cnjugate if s T n)asj) = 0 fr all n j. When A = I, the meaning is clear: The vectrs are rthgnal Cnjugate Vectrs: Prperties The square rt f the matrix A is defined as: A = A 1/2 A 1/2 existence and uniqueness is guaranteed fr ps. semidefinite matrices). Since A is symmetric psitive definite, A T = A 1/2 A 1/2) T = A 1/2 ) T A 1/2 ) T Thus, we get A 1/2 = A 1/2 ) T. Given any cnjugate vectrs x i and x j, x i Ax j = x T i A 1/2 A 1/2) x j = x T i = x T i A1/2 ) T A 1/2 x j = S, transfrming any nnzer) vectr x i int: v i = A 1/2 x i results in vectrs v i that are mutually rthgnal. 11 A 1/2 ) T A 1/2) x j A 1/2 x i ) T A 1/2 x j = 0. Cnjugate Vectrs: Prperties Lin. Indep.) The cnjugate vectrs are linearly independent, s they can frm a basis t span the vectr space. Prf: Assume they are nt linearly independent: s0) = W 1 j=1 Multiplying bth sides by As0), we get: s T 0)As0) = W 1 j=1 α j sj). α j s T j)as0). But, this is 0, since s T 0)As0) = 0 since these are cnjugate vectrs. Hwever, A is psitive definite and the vectrs are nnzer, the sum cannt be 0, a cntradictin. Thus, the cnj. vectrs are nt linearly dependent. 12

4 Cnjugate Directin Methd Fr a quadratic errr functin fx) xn + 1) = xn) + ηn)sn), n = 0, 1,..., W 1, where x0) is an arbitrary initial vectr and ηn) is defined by fxn) + ηn)sn)) = min fxn) + ηsn)). η The part where ηn) is picked is called a line search. S, in sum, bth the directin and the distance are determined. 13 Observatins Cnjugate Directin Methd cnt d) Since the cnjugate vectrs are linearly independent, they frm a basis that spans the vectr space f x. The line search results in the fllwing frmula fr η: ηn) = st n)aɛn) s T n)asn), where ɛn) = xn) x is the errr vectr. But x is unknwn! Ax x ) = Ax b) Ax b) = Ax b). }{{} This is 0. Starting frm an arbitrary x0), the methd reaches the minimum in at mst W iteratins. 14 Cnjugate Directin Methd cnt d) Fr each iteratin n, xn + 1) minimizes fx) ver a linear vectr space D n that passes thrugh an arbitrary pint x0), and is spanned by the A-cnjugate vectrs: where xn + 1) = arg min fx), x D n D n = xn) xn) = x0) + n ηj)sj). j=0 The whle scheme depends n the existence f the cnjugate vectrs sn): Slutin cnjugate gradient methd. 15 Cnjugate Gradient Methd Determine successive cnjugate vectrs {sn)} in a sequential manner at successive steps. Residual: rn) = b Axn) nte: this is f/ x). Recursive step: Find the next step sn) as sn) = rn) + βn)sn 1), n = 1, 2,..., W 1, where the scaling factr βn) is determined as βn) = st n 1)Arn) s T n 1)Asn). This results in cnjugate vectrs sn)! Prblem: we need t knw A, with respect t which the vectrs are cnjugate. 16

5 Cnjugate Gradient Methds: Alternative βn)s Cnjugate Gradient Methd: Estimating ηn) We can evaluate the scaling factr βn) withut an explicit knwledge f the matrix A, based nly n the residuals: Plak-Ribiére frmula Fletcher-Reeves frmula βn) = rt n)rn) rn 1)) r T. n 1)rn 1) r T n)rn) βn) = r T n 1)rn 1). Plak-Ribiére frm is superir fr nnquadratic cst functins, and is guaranteed t cnverge if β = maxβ PR, 0). Line search prcedure estimates the prper ηn): Bracketing phase: find a nntrivial interval that is knwn t cntain the minimum. Sectining phase: divide the bracket int sectins, leading t successively narrwer brackets. These tw steps abve can be achieve by inverse parablic apprximatin. Repeated applicatin f the tw phases results in gd estimatin f the pint f minimum Cnjugate Gradient Methd vs. Gradient Descent Applicatin f CGM t NN Learning Apprximate the cst functin E av w) by a quadratic functin ignre HOT). Gradient descent can scillate badly. Cnjugate gradient methd can directly bring yu t the minimum in ne step plus the initial step) if the cst functin is quadratic fr a 2D case). Figure frm Gutierrez-Osuna s 636 ntes. Frmulate the cmputatin f cefficients βn) and ηn) s as t nly require gradient infrmatin: avid calculating the Hessian. Quadratic functin fx) Parameter vectr xn) Gradient vectr fx) x Matrix A Cst functin E av w) Weight vectr wn) Gradient vectr g = E av w Hessian matrix H 19 20

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Supprt Vectr Machines Instructr: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted fr this curse