Introduction to Machine Learning DIS10

CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig all of the partial derivatives ad settig them to 0, we get this system of equatios: λx = 1 λy = 1 x + y = 3 We ca ifer that y = x. Pluggig this ito the costrait, we have: x + 4x = 3 3 which shows that x = ± 5. We have two critical poits, ( 3 5, 3 5 ) ad ( 3 5, 3 5 ). Pluggig these ito our objective fuctio f, we fid that the miimizer is the former, with a value of 3 3 5. (b) Miimize the fuctio such that Solutio: The Lagragia is: f (x,y,z) = x y x + y + 3z = 1. x y + λ(x + y + 3z 1) Takig all of the partial derivatives ad settig them to 0, we get this system of equatios: x = λx y = λy 0 = λz x + y + 3z = 1 CS 189, Fall 017, DIS10 1

To solve this, we look at several cases: Case 1: λ = 0. This implies that x = y = 0, ad z = ± 1 3. We have two critical poits: (0,0,± 3 1). Case : λ 0. z must be 0. Case a: x = 0. The costrait gives us that y = ± 1. This gives us aother two critical poits: (0,± 1,0). Case b: y = 0. The costrait gives us x = ±1, givig us aother two critical poits: (±1,0,0). Pluggig i all of our critical poits, we fid that (0,± 1,0) miimizes our fuctio with a value of 1. Support Vector Machies (a) We typically frame a SVM problem as tryig to maximize the margi. Explai ituitively why a bigger margi will result i a model that will geeralize better, or perform better i practice. Solutio: Oe ituitio is that if poits are closer to the border, we are less certai about their class. Thus, it would make sese to create a boudary where our certaity is highest about all the traiig set poits. Aother ituitio ivolves thikig about the process that geerated the data we are workig with. Sice it s a oisy process, if we drew a boudary close to oe of our traiig poits of some class, it s very possible that a poit of the same class will be geerated across the boudary, resultig i a icorrect classificatio. Therefore it makes sese to make the boudary as far away from our traiig poits as possible. (b) Will movig poits which are ot support vectors further away from the decisio boudary effect the SVM s hige loss? Solutio: No, the hige loss is defied as N max(0,1 y i( (w) (x) + b)). For osupport vectors, the right had side of the max fuctio is already egative ad movig the poit further away from the boudary will make it oly more egative. The max will retur a zero regardless. This meas that the loss fuctio ad the cosequet decisio boudary is etirely determied by the orietatio of the support vectors ad othig else. (c) Show that the width of a SVM slab with liearly separable data is w. Solutio: The width of the margi is defied by the poits that lie o it, also called support vectors. Let s say we have a poit, x, which is a support vector. The distace betwee x ad the separatig hyperplae ca be calculated by projectig the vector startig at the plae ad edig at x oto the plae s uit ormal vector. The equatio of the plae is w T x + b = 0. CS 189, Fall 017, DIS10

Sice w by defiitio is orthogoal to the hyperplae, we wat to project x x oto the uit vector ormal to the hyperplae, w w. w T w ( x x) = 1 w ( wt x w T x) = 1 w ( wt x + b w T x b) Sice we set w T x +b = 1 (or 1) ad by defiitio, w T x+b = 0, this quatity just turs ito 1 w, or 1 1 w, so the distace is the absolute value, w. Sice the margi is half of the slab, we double it to get the full width of w. (d) You are preseted with the followig set of data (triagle = +1, circle = -1): Fid the equatio (by had) of the hyperplae w T x + b = 0 that would be used by a SVM classifier. Which poits are support vectors? Solutio: The equatio of the hyperplae will pass through poit (,1), with a slope of -1. The equatio of this lie is x 1 + x = 3. We kow that from this form, w 1 = w. We also kow that the at the support vectors, w T x + b = ±1. This gives us the equatios: 1w 1 + 0w + b = 1 3w 1 + w + b = 1 Solvig this system of equatios, we get w = [ 1, 1 ]T ad b = 3. The support vectors are (1,0),(0,1), ad (3,). 3 Simple SGD updates Let us cosider a simple least squares problem, where we are iterested i optimizig the fuctio F(w) = 1 Aw y = 1 1 (a i w y i ). (a) What is the closed form OLS solutio? What is the time complexity of computig this solutio i terms of flops? CS 189, Fall 017, DIS10 3

Solutio: The closed form solutio is ŵ = (A A) 1 A y. (1) This takes time d + d + d 3 to compute d to fid A A sice it takes multiplicatios to compute each etry of this d d matrix, d to fid A y sice it takes d multiplicatios to compute each etry of this -vector, ad d 3 time to ivert a matrix via Gaussia elimiatio. (b) Write dow the gradiet descet update. What is the time complexity of computig a ε optimal solutio? Solutio: For gradiet descet, we have the update w t+1 = w t γ A (Aw t y). We kow from HW that deotig e k = w k w ad lettig Q deote ( the coditio umber of A A, we have e k Q 1 Q+1) ek 1. We therefore obtai geometric covergee to the optimum, ad the umber of iteratios is roughly T Qlog(1/ε) to coverge to withi ε of optimum (write this out to see why, usig the approximatio 1 x e x ). Also ote that durig each iteratio, we perform work d, sice Aw t takes d time to compute, ad performig the multiplicatio A (Aw t y) takes d time as well. So the total cost is d log(1/ε). (c) Write dow the stochastic gradiet descet update. What is the time complexity of computig a ε optimal solutio? You may wat to quickly go through a derivatio here. What happes whe Aw = y? Discuss why you would use ay of these methods for your problem. Solutio: Let us derive the SGD covergece rate from first priciples. We have the update equatio w t+1 = w t γa J (a J w t y J ), where J is chose uiformly at radom from the set {1,,3,...,}. Notice that this makes sese as a oisy gradiet, sice E J [a J (a J w t y J )] = f J (w t ) = 1 (A Aw t A y) = f (w t ) ad so the gradiet estimate is ubiased. Let us ow compute the covergece rate. We have w t+1 w = w t w + γ f J (w t ) γ(w t w ) f J (w t ). Now otice that there are two sources of radomess i the RHS. The iterate w t is i itself radom, sice we have chose radom idices up util that poit. The idex J is also radom. Crucially, these two sources of radomess are idepedet of each other. I particular, we may ow take the ier product ad compute the expectatio over the idex CS 189, Fall 017, DIS10 4

J to obtai ] E J [γ(w t w ) f J (w t ) = γ(w t w ) E[ f J (w t )] = γ(w t w ) f (w t ) = γ(w t w ) A A(w t w ) = A(w t w ) λ mi (A A) w t w, where i the last step, we have used a simple eigevalue boud; go back ad look at HW6 if this is ot clear. Lettig m = λ mi (A A), we have E J [ wt+1 w ] (1 γm) wt w + γ E J f J (w t ). We will ow make some additioal assumptios. First, we assume that a i = 1. Next, we assume that we will always stay withi a regio such that the fuctio F(w) M (ote that we ca do this by evaluatig the loss ad esurig that we do t take a step if this coditio is violated, or by projectio.) Cosequetly, we have E J f J (w t ) = 1 = 1 = F(w t ) M. a i (a i w t y i ) a i (a i w t y i ) We are ow i a positio to complete the aalysis. We have E [ w t+1 w ] (1 γm)e [ wt w ] + γ M, where we have take a additioal expectatio with respect to the radomess up to ad icludig time t. Rollig this out (thik of iductio i reverse), we have E [ w t+1 w ] (1 γm) E [ w t 1 w ] + γ M (1 γm) + γ M. Do you spot the patter? We effectively have E [ w t w ] (1 γm) t E [ w 0 w ] + γ M t 1 (1 γm) i i=0 (1 γm) t E [ w 0 w ] + γ M (1 γm) i i=0 = (1 γm) t E [ w 0 w ] + γ M 1 γm = (1 γm) t E [ w 0 w ] M + γ m. CS 189, Fall 017, DIS10 5

Now if we wat the LHS to be less tha ε, it suffices to set each of the above terms to be less tha ε/. I particular, we have the relatios γ M ε/ ad m (1 γm) t E [ w 0 w ] ε/. Doig some algebra, we are led to the choices γ = εm 4M, ad t = 1 γm log(d 0/ε) = 4M m where D 0 deotes our iitial distace to optimum. log(d 0 /ε), ε I effect, we coverge i ε 1 log(1/ε) iteratios, ad each iteratio takes O(d) time (why?). Let us ow compare all three algorithms. Clearly, GD beats OLS provided dlog1/ε < d, which happes whe d > log(1/ε). Thik about what this meas! Settig ε = 10 6 (almost optimum), we see that GD wis for ay problem i which d > 0! Comparig SGD ad GD, the quatities are dlog(1/ε) d ε log(1/ε). I other words, SGD provides gais i covergece whe 1/ε, i.e., whe we have sufficietly may samples. There are also other advatages to SGD that this aalysis does t quite illustrate; for istace scalability ad geeralizatio ability. Comparig SGD ad OLS, we see that SGD wis whe d > d ε log(1/ε), ad so the relevat compariso is betwee d ad 1/ε. SGD agai wis for moderately sized problems. (d) Write dow the SGD update for logistic regressio o two classes F(w) = 1 y i log 1 σ(w x i ) + (1 y 1 i)log 1 σ(w x i ). Discuss why this is equivalet to miimizig a cross-etropy loss. CS 189, Fall 017, DIS10 6