CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Size: px

Start display at page:

Download "CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning"

Sheena Lewis
5 years ago
Views:

1 CS9 Problem Set #3 Solutons CS 9, Publc Course Problem Set #3 Solutons: Learnng Theory and Unsupervsed Learnng. Unform convergence and Model Selecton In ths problem, we wll prove a bound on the error of a smple model selecton procedure. Let there be a bnary classfcaton problem wth labels y {,}, and let H H... H k be k dfferent fnte hypothess classes ( H < ). Gven a dataset S of m d tranng examples, we wll dvde t nto a tranng set S tran consstng of the frst ( β)m examples, and a hold-out cross valdaton set S cv consstng of the remanng βm examples. Here, β (,). Let ĥ = arg mn h H ˆε Stran (h) be the hypothess n H wth the lowest tranng error (on S tran ). Thus, ĥ would be the hypothess returned by tranng (wth emprcal rsk mnmzaton) usng hypothess class H and dataset S tran. Also let h = arg mn h H ε(h) be the hypothess n H wth the lowest generalzaton error. Suppose that our algorthm frst fnds all the ĥ s usng emprcal rsk mnmzaton then uses the hold-out cross valdaton set to select a hypothess from ths the {ĥ,...,ĥk} wth mnmum tranng error. That s, the algorthm wll output ĥ = arg mn h {ĥ,...,ĥk} ˆε Scv (h). For ths queston you wll prove the followng bound. Let any δ > be fxed. Then wth probablty at least δ, we have that ( ) ε(ĥ) mn ε(h ) + =,...,k ( β)m log 4 H + δ (a) Prove that wth probablty at least δ, for all ĥ, ε(ĥ) ˆε Scv (ĥ). Answer: For each ĥ, the emprcal error on the cross-valdaton set, ˆε(ĥ) represents the average of βm random varables wth mean ε(ĥ), so by the Hoeffdng nequalty for any ĥ, P( ε(ĥ) ˆε Scv (ĥ) γ) exp( γ βm). As n the class notes, to nsure that ths holds for all ĥ, we need to take the unon over all k of the ĥ s. P(,s.t. ε(ĥ) ˆε Scv (ĥ) γ) k exp( γ βm).

2 CS9 Problem Set #3 Solutons Settng ths term equal to δ/ and solvng for γ yelds γ = provng the desred bound. (b) Use part (a) to show that wth probablty δ, ε(ĥ) mn ε(ĥ) + =,...,k βm log 4k δ. Answer: Let j = arg mn ε(ĥ). Usng part (a), wth probablty at least δ ε(ĥ) ˆε S (ĥ) + cv = mn ˆε Scv (ĥ) + ˆε Scv (ĥj) + ε(ĥj) + = mn =,...,k ε(ĥ) + βm log 4k δ (c) Let j = arg mn ε(ĥ). We know from class that for H j, wth probablty δ ε(ĥj) ˆε Stran (h j) ( β)m log 4 H j, h j H j. δ Use ths to prove the fnal bound gven at the begnnng of ths problem. Answer: The bounds n parts (a) and (c) both hold smultaneously wth probablty ( δ ) = δ + δ 4 > δ, so wth probablty greater than δ, ε(ĥ) ε(h j) + ( γ)m log H j + δ γm log k δ whch s equvalent to the bound we want to show.. VC Dmenson Let the nput doman of a learnng problem be X = R. Gve the VC dmenson for each of the followng classes of hypotheses. In each case, f you clam that the VC dmenson s d, then you need to show that the hypothess class can shatter d ponts, and explan why there are no d + ponts t can shatter.

3 CS9 Problem Set #3 Solutons 3 h(x) = {a < x}, wth parameter a R. Answer: VC-dmenson =. (a) It can shatter pont {}, by choosng a to be and. (b) It cannot shatter any two ponts {x,x }, x < x, because the labellng x = and x = cannot be realzed. h(x) = {a < x < b}, wth parameters a,b R. Answer: VC-dmenson =. (a) It can shatter ponts {,} by choosng (a,b) to be (3,5), (,), (,3), (,3). (b) It cannot shatter any three ponts {x,x,x 3 }, x < x < x 3, because the labellng x = x 3 =, x = cannot be realzed. h(x) = {asn x > }, wth parameter a R. Answer: VC-dmenson =. a controls the ampltude of the sne curve. (a) It can shatter pont { π } by choosng a to be and. (b) It cannot shatter any two ponts {x,x }, snce, the labellngs of x and x wll flp together. If x = x = for some a, then we cannot acheve x x. If x x for some a, then we cannot acheve x = x = (x = x = can be acheved by settng a = ). h(x) = {sn(x + a) > }, wth parameter a R. Answer: VC-dmenson =. a controls the phase of the sne curve. (a) It can shatter ponts { π 4, 3π 4 }, by choosng a to be, π, π, and 3π. (b) It cannot shatter any three ponts {x,x,x 3 }. Snce sne has a preod of π, let s defne x = x mod π. W.l.o.g., assume x < x < x 3. If the labellng of x = x = x 3 = can be realzed, then the labellng of x = x 3 =, x = wll not be realzable. Notce the smlarty to the second queston. 3. l regularzaton for least squares In the prevous problem set, we looked at the least squares problem where the objectve functon s augmented wth an addtonal regularzaton term λ θ. In ths problem we ll consder a smlar regularzed objectve but ths tme wth a penalty on the l norm of the parameters λ θ, where θ s defned as θ. That s, we want to mnmze the objectve J(θ) = m n (θ T x () y () ) + λ θ. = There has been a great deal of recent nterest n l regularzaton, whch, as we wll see, has the beneft of outputtng sparse solutons (.e., many components of the resultng θ are equal to zero). The l regularzed least squares problem s more dffcult than the unregularzed or l regularzed case, because the l term s not dfferentable. However, there have been many effcent algorthms developed for ths problem that work very well n practve. One very straghtforward approach, whch we have already seen n class, s the coordnate descent method. In ths problem you ll derve and mplement a coordnate descent algorthm for l regularzed least squares, and apply t to test data. =

4 CS9 Problem Set #3 Solutons 4 (a) Here we ll derve the coordnate descent update for a gven θ. Gven the X and y matrces, as defned n the class notes, as well a parameter vector θ, how can we adjust θ so as to mnmze the optmzaton objectve? To answer ths queston, we ll rewrte the optmzaton objectve above as J(θ) = Xθ y + λ θ = X θ + X θ y + λ θ + λ θ where X R m denotes the th column of X, and θ s equal to θ except wth θ = ; all we have done n rewrtng the above expresson s to make the θ term explct n the objectve. However, ths stll contans the θ term, whch s non-dfferentable and therefore dffcult to optmze. To get around ths we make the observaton that the sgn of θ must ether be non-negatve or non-postve. But f we knew the sgn of θ, then θ becomes just a lnear term. That, s, we can rewrte the objectve as J(θ) = X θ + X θ y + λ θ + λs θ where s denotes the sgn of θ, s {,}. In order to update θ, we can just compute the optmal θ for both possble values of s (makng sure that we restrct the optmal θ to obey the sgn restrcton we used to solve for t), then look to see whch acheves the best objectve value. For each of the possble values of s, compute the resultng optmal value of θ. [Hnt: to do ths, you can fx s n the above equaton, then dfferentate wth respect to θ to fnd the best value. Fnally, clp θ so that t les n the allowable range.e., for s =, you need to clp θ such that θ.] Answer: For s =, J(θ) = tr(x θ + X θ y) T (X θ + X θ y) + λ θ + λθ = so J(θ) θ whch means the optmal θ s gven by ( X T X θ + X T (X θ y)θ + X θ y ) + λ θ + λθ, = X T X θ + X T (X θ y) + λ { X T θ = max (X θ y) λ X TX },. Smlarly, for s =, the optmal θ s gven by { } X T θ = mn (X θ y) + λ X TX,. (b) Implement the above coordnate descent algorthm usng the updates you found n the prevous part. We have provded a skeleton theta = lls(x,y,lambda) functon n the q3/ drectory. To mplement the coordnate descent algorthm, you should repeatedly terate over all the θ s, adjustng each as you found above. You can termnate the process when θ changes by less than 5 after all n of the updates. Answer: The followng s our mplementaton of lls.m:

5 CS9 Problem Set #3 Solutons 5 functon theta = ll(x,y,lambda) m = sze(x,); n = sze(x,); theta = zeros(n,); old_theta = ones(n,); whle (norm(theta - old_theta) > e-5) old_theta = theta; for =:n, % compute possble values for theta theta() = ; theta_() = max((-x(:,) *(X*theta - y) - lambda) / (X(:,) *X(:,)), ); theta_() = mn((-x(:,) *(X*theta - y) + lambda) / (X(:,) *X(:,)), ); % get objectve value for both possble thetas theta() = theta_(); obj_theta() =.5*norm(X*theta - y)^ + lambda*norm(theta,); theta() = theta_(); obj_theta() =.5*norm(X*theta - y)^ + lambda*norm(theta,); % pck the theta whch mnmzes the objectve [mn_obj, mn_nd] = mn(obj_theta); theta() = theta_(mn_nd); (c) Test your mplementaton on the data provded n the q3/ drectory. The [X, y, theta true] = load data; functon wll load all the data the data was generated by y = X*theta true +.5*randn(,), but theta true s sparse, so that very few of the columns of X actually contan relevant features. Run your lls.m mplementaton on ths data set, rangng λ from. to. Comment brefly on how ths algorthm mght be used for feature selecton. Answer: For λ =, our mplementaton of l regularzed least squares recovers the exact sparsty pattern of the true parameter that generated the data. In constrast, usng any amount of l regularzaton stll leads to θ s that contan no zeros. Ths suggests that the l regularzaton could be very useful as a feature selecton algorthm: we could run l regularzed least squares to see whch coeffcents are non-zero, then select only these features for use wth ether least-squares or possbly a completely dfferent machne learnng algorthm. 4. K-Means Clusterng In ths problem you ll mplement the K-means clusterng algorthm on a synthetc data set. There s code and data for ths problem n the q4/ drectory. Run load X.dat ; to load the data fle for clusterng. Implement the [clusters, centers] = k means(x, k) functon n ths drectory. As nput, ths functon takes the m n data matrx X and the number of clusters k. It should output a m element vector, clusters, whch ndcates whch of the clusters each data pont belongs to, and a k n matrx, centers, whch contans the centrods of each cluster. Run the algorthm on the data provded, wth k = 3

6 6 CS9 Problem Set #3 Solutons and k = 4. Plot the cluster assgnments and centrods for each teraton of the algorthm usng the draw clusters(x, clusters, centrods) functon. For each k, be sure to run the algorthm several tmes usng dfferent ntal centrods. The followng s our mplementaton of k means.m: Answer: functon [clusters, centrods] = k_means(x, k) m = sze(x,); n = sze(x,); oldcentrods = zeros(k,n); centrods = X(cel(rand(k,)*m),:); whle (norm(oldcentrods - centrods) > e-5) oldcentrods = centrods; % compute cluster assgnments for =:m, dsts = sum((repmat(x(,:), k, ) - centrods).^, ); [mn_dst, clusters(,)] = mn(dsts); draw_clusters(x, clusters, centrods); pause(.); % compute cluster centrods for =:k, centrods(,:) = mean(x(clusters ==, :)); Below we show the centrod evoluton for two typcal runs wth k = 3. Note that the dfferent startng postons of the clusters lead to do dfferent fnal clusterngs The Generalzed EM algorthm When attemptng to run the EM algorthm, t may sometmes be dffcult to perform the M step exactly recall that we often need to mplement numercal optmzaton to perform the maxmzaton, whch can be costly. Therefore, nstead of fndng the global maxmum of our lower bound on the log-lkelhood, and alternatve s to just ncrease ths lower bound a lttle bt, by takng one step of gradent ascent, for example. Ths s commonly known as the Generalzed EM (GEM) algorthm.

7 CS9 Problem Set #3 Solutons 7 Put slghtly more formally, recall that the M-step of the standard EM algorthm performs the maxmzaton θ := arg max Q (z () )log p(x(),z() ;θ). θ Q (z () ) z () The GEM algorthm, n constrast, performs the followng update n the M-step: θ := θ + α θ Q (z () )log p(x(),z() ;θ) Q (z () ) z () where α s a learnng rate whch we assume s choosen small enough such that we do not decrease the objectve functon when takng ths gradent step. (a) Prove that the GEM algorthm descrbed above converges. To do ths, you should show that the the lkelhood s monotoncally mprovng, as t does for the EM algorthm.e., show that l(θ (t+) ) l(θ (t) ). Answer: We use the same logc as for the standard EM algorthm. Specfcally, just as for EM, we have for the GEM algorthm that l(θ (t+) ) z () Q (t) = l(θ (t) ) z () Q (t) (z () )log p(x(),z () ;θ (t+) ) Q (t) (z () ) (z () )log p(x(),z () ;θ (t) ) Q (t) (z () ) where as n EM the frst lne holds due to Jensen s equalty, and the last lne holds because we choose the Q dstrbuton to make ths hold wth equalty. The only dfference between EM and GEM s the logc as to why the second lne holds: for EM t held because θ (t+) was chosen to maxmze ths quantty, but for GEM t holds by our assumpton that we take a gradent step small enough so as not to decrease the objectve functon. (b) Instead of usng the EM algorthm at all, suppose we just want to apply gradent ascent to maxmze the log-lkelhood drectly. In other words, we are tryng to maxmze the (non-convex) functon l(θ) = log z () p(x (),z () ;θ) so we could smply use the update θ := θ + α θ log p(x (),z () ;θ). z () Show that ths procedure n fact gves the same update as the GEM algorthm descrbed above. Answer: Dfferentatng the log lkelhood drectly we get log p(x (),z () ;θ) = p(x (),z () ;θ) θ j z () z () p(x(),z () ;θ) θ j z () = p(x () ;θ) p(x (),z () ;θ). θ j z ()

8 CS9 Problem Set #3 Solutons 8 For the GEM algorthm, θ j Q (z () )log p(x(),z() ;θ) Q (z () ) z () = Q (z() ) p(x (),z () ;θ) p(x (),z () ;θ). θ j z () But the E-step of the GEM algorthm chooses so Q (z () ) = p(z () x () ;θ) = p(x(),z () ;θ), p(x () ;θ) Q (z() ) p(x (),z () ;θ) p(x (),z () ;θ) = θ j z () p(x () ;θ) p(x (),z () ;θ) θ j z () whch s the same as the dervatve of the log lkelhood.

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume