CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n? In the prevous lecture, we examned rdge regresson and provded a dmenson free rate of convergence. Now let us examne feature selecton. 2 Feature Selecton Let us suppose there are s relevant features out of the d possble features. Throughout ths analyss, let us assume that: Y = Xw + η, where Y R n and X R n d. We assume that the support of w (the number of non-zero entres) s s. 2.1 Loss Mnmzaton (Emprcal Rsk Mnmzaton) Defne our emprcal loss as: whch has no expectaton over Y. ˆL(w) = 1 Xw Y 2 n Suppose we knew the support sze s. One algorthm s to smply fnd the estmator whch mnmzes the emprcal loss and has support only on s coordnates. In partcular, consder the estmator: ŵ subset selecton = arg mn support(w) s ˆL(w) where the nf s over vectors wth support sze s. Computng ths estmator s not computatonally tractable n general (the nave algorthm runs n tme d s ). Furthermore, fndng the best subset s known to be an NP-hard problem. How much better s ths estmator better than the nave estmator? Recall the rsk s: where the expectaton s over Y. We have the followng theorem: R(ŵ subset selecton ) = E Y ŵ T w 2 Σ Theorem 2.1. Suppose the support of w s bounded by s. We have that the rsk s bounded as: (where c sa unversal constant). R(ŵ subset selecton ) c s log d n σ2 1
2.2 Coordnate dependence? Clearly, the coordnates system s mportant here, as the support s defned wth respect to ths coordnate system. However, note that the scale n each coordnate s rrelevant here. In contrast, note the emprcal rsk mnmzaton does not depend on the coordnate system. 3 Norms The l p of a vector x s: The l 0 norm s defned as: x p = ( x p ) 1 p x p = { x 0} whch s the number of non-zero entres n x. Techncally, the l 0 norm s not a norm. 4 Lasso Let us vew the the subset selecton problem as a regularzed problem. A relaxed verson of a hard constrant on the sze of the subset would be to mnmze: ˆL(w) + λ w 0 One can show that for an approprate choce of λ ths algorthm also enjoys the same rsk guarantee of the hard constraned subset selecton algorthm (up to constants). Unfortunately, mnmzng ths objectve functon s also not computatonally tractable. A natural convex relaxaton s to nstead consder mnmzng the followng: F (w) = ˆL(w) + λ w 1 whch can be vewed as a convex relaxaton to the l 0 problem. Ths s referred to as the Lasso. 4.1 Coordnate Scalngs Often t s a good dea to transform the data so that the varance along each coordnate s 1. In other words, for each coordnate j, t often makes sense to do the followng transformaton: X,j X,j /Z where Z j = 1 n Intutvely, ths s to remove an arbtrary scale factor. A more precse reason for ths wll be dscussed n the next lecture. X 2,j 2
4.2 Optmzaton & Coordnate Descent The 1-dmensonal case: Suppose that we are n n the 1-dmensonal case where each x s a scalar and so w s a scalar. The lasso problem s then to mnmze: (y wx ) 2 + λ w where w s the absolute value functon. To mnmze ths functon, we can agan set the gradent to 0 and solve. A subtlety here s that the absolute value functon s non-dfferentable at 0. Note that for any w 0 the gradent s: 2 x (y wx ) + λsgn(w) (1) where sgn(w) s 1 f w s postve and 1 f w s negatve. There are three cases to check. If the mnmzer w s postve, then we know that the frst order condton mples that: w = y x λ/2 x2 If we compute the rght hand sde and t s postve, then ndeed ths value s the mnmzer. Now suppose the mnmzer w s negatve, then we know that the frst order condton mples that: w = y x + λ/2 x2 If we compute the rght hand sde and t s negatve, then ndeed ths value s the mnmzer. Now suppose w s 0. Note that w s not dfferentable at 0. Here, one can show that we must have that: 2 y x [ λ, λ] So f we compute the left hand sde and t s n the nterval [ λ, λ] then w = 0 s a mnmzer. To see ths, consder any small perturbaton so that w = ɛ. Suppose ɛ > 0. For suffcently small ɛ, the frst term n (1) wll stll be n the nterval [ λ, λ] and so the gradent wll be strctly postve (for small ɛ). Thus gradent descent wll push us back to 0. Smlarly, for ɛ < 0, we wll move back to 0. Formally, the sub-gradent of w can take any value n [ 1, 1], whch s a vald tangent plane (see the wkpeda defnton). Coordnate Ascent: The coordnate ascent algorthm for mnmzng an objectve functon F (w 1, w 2,... w n ) s as follows: 1. Intalze: w = 0 2. choose a coordnate (e.g. at random) 3. update w as follows: w arg mn z R F (w 1,..., w 1, z, w +1,... w d ) where the optmzaton s over the -th coordnate (holdng the other coordnates fxed). Then return to step 2. Clearly, many natural varants are possble. 3
5 Relatonshp of Lasso to Compressed Sensng As we dscussed earler, regresson can be vewed as fndng an approxmate soluton to an (nconsstent) lnear system of equatons. In compressed sensng, we are dealng wth the settng were the system of equatons s consstent,.e. Aw = b has a soluton. Suppose we are n the case where A s of sze n d and d > n, so there are multple solutons. In partcular, we seek the sparsest soluton: As before, ths problem s not computatonally tractable. mn w 0 s.t. Ax = b w The convex relaxaton s the followng optmzaton problem: mn w 1 s.t. Ax = b w Smlar to the case of lasso, under certan assumptons ths can recover the soluton to the l 0 problem. 6 Greedy Algorthms There are varety of greedy algorthms and numerous namng conventons for these algorthms. These algorthms must rely on some stoppng condton (or some condton to lmt the sparsty level of the soluton). 6.1 Stagewse Regresson / Matchng Pursut / Boostng Here, we typcally do no regularze our objectve functon and, nstead, drectly deal wth the emprcal loss ˆL(w 1, w 2,... w n ). Ths class of algorthms for mnmzng an objectve functon ˆL(w 1, w 2,... w n ) s as follows: 1. Intalze: w = 0 2. choose the coordnate whch can result n the greatest decrease n error,.e. 3. update w as follows: arg mn mn z R F (w 1,..., w 1, z, w +1 w arg mn z R F (w 1,..., w 1, z, w +1,... w d ) where the optmzaton s over the -th coordnate (holdng the other coordnates fxed). 4. Whle some termnaton condton s not met, return to step 2. Ths termnaton condton can be lookng at the error on some holdout set or smply just runnng the algorthm for some predetermned number of steps. Varants: Clearly, many varants are possble. Sometmes (for loss functons other than the square loss) t s costly to do the mnmzaton exactly so we sometmes choose based on another method (e.g. the magntude of the gradent of a coordnate). We could also re-optmze all the weghts of all those features whch were are currently added. Also, sometmes we do backward steps where we try to prune away some of the features whch are added. Relaton to boostng: In boostng, we sometmes do not explctly enumerate the set of all features. Instead, we have a weak learner whch provdes us wth a new feature. The mportance of ths vewpont s that sometmes t s dffcult to enumerate the set of all features (e.g. our features could be decson trees, so our feature vector x could be of dmenson the number of possble tress). Instead, we just assume some oracle whch n step 2 whch provdes us wth a feature. There are numerous varants. 4
6.2 Stepwse Regresson / Orthogonal Matchng Pursut Note that the prevous algorthm fnds by only checkng the mprovement n performance keepng all the other varables fxed. At any gven teraton, we have some subset S of features whose weghts are not 0. Instead, when determnng whch coordnate to add, we could look mprovement based on reoptmzng the weghts on the full set S {}. Ths s a more costly procedure computatonally, though there are some ways to reduce the computatonal cost. 5