Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

CSE 546: Mache Learg Lecture 6 Feature Selecto: Part 2 Istructor: Sham Kakade Greedy Algorthms (cotued from the last lecture) There are varety of greedy algorthms ad umerous amg covetos for these algorthms. These algorthms must rely o some stoppg codto (or some codto to lmt the sparsty level of the soluto).. Stagewse Regresso / Matchg Pursut / Boostg Here, we typcally do o regularze our objectve fucto ad, stead, drectly deal wth the emprcal loss ˆL(w, w 2,... w ). Ths class of algorthms for mmzg a objectve fucto ˆL(w, w 2,... w ) s as follows:. Italze: w = 0 2. choose the coordate whch ca result the greatest decrease error,.e. 3. update w as follows: arg m m z R F (w,..., w, z, w + w arg m z R F (w,..., w, z, w +,... w d ) where the optmzato s over the -th coordate (holdg the other coordates fxed). 4. Whle some termato codto s ot met, retur to step 2. Ths termato codto ca be lookg at the error o some holdout set or smply just rug the algorthm for some predetermed umber of steps. Varats: Clearly, may varats are possble. Sometmes (for loss fuctos other tha the square loss) t s costly to do the mmzato exactly so we sometmes choose based o aother method (e.g. the magtude of the gradet of a coordate). We could also re-optmze all the weghts of all those features whch were are curretly added. Also, sometmes we do backward steps where we try to prue away some of the features whch are added. Relato to boostg: I boostg, we sometmes do ot explctly eumerate the set of all features. Istead, we have a weak learer whch provdes us wth a ew feature. The mportace of ths vewpot s that sometmes t s dffcult to eumerate the set of all features (e.g. our features could be decso trees, so our feature vector x could be of dmeso the umber of possble tress). Istead, we just assume some oracle whch step 2 whch provdes us wth a feature. There are umerous varats.

.2 Stepwse Regresso / Orthogoal Matchg Pursut Note that the prevous algorthm fds by oly checkg the mprovemet performace keepg all the other varables fxed. At ay gve terato, we have some subset S of features whose weghts are ot 0. Istead, whe determg whch coordate to add, we could look mprovemet based o reoptmzg the weghts o the full set S {}. Ths s a more costly procedure computatoally, though there are some ways to reduce the computatoal cost. 2

2 Feature Selecto the Orthogoal Case Let us suppose there are s relevat features out of the d possble features. Throughout ths aalyss, let us assume that: Y = Xw + η, where Y R ad X R d ad η R s the Gaussa ose vector (wth each coordate sampled N(0, σ 2 )). We assume that the support of w (the umber of o-zero etres) s s. Let us suppose that our desg matrx s orthogoal. I partcular, suppose that: Σ = X X = dagoal Now let us cosder the least squares estmate (gorg feature selecto ssues). Uder the dagoal assumpto, wthout loss of geeralty, we ca assume that: Σ = I (by rescalg each coordate). Here we have that j-th coordate of the (global) least squares estmate [ŵ] j s just correlato betwee j-th dmeso ad Y. [ŵ least squares ] j = [X Y] j = X,j Y 2. A hgh probablty regret boud Suppose we kew the support sze s. Let us cosder the the estmator whch mmzes the emprcal loss ad has support oly o s coordates. I partcular, cosder the estmator: where the f s over vectors wth support sze s. ŵ subset selecto = arg m support(w) s ˆL(w) I the orthogoal case, computg ths estmator s actually rather easy. Provded we have scaled the coordates so that Σ s the detty, our estmate smply choose these the s largest coordates of ŵ least squares. I other words, a smple forward greedy algorthm suffces to compute ŵ subset selecto. Now let us explctly provded the followg hgh probablty boud o the parameter error: Theorem 2.. (hgh probablty boud) We have that wth probablty greater tha δ: ŵ subset selecto w 2 0s log(2d/δ) σ 2 Proof. For ay partcular coordate j, the Gaussa tal boud mples that, wth probablty greater tha δ, that: 2σ2 log(/δ) [ŵ least squares ] j [w ] j + () ad also that: 2σ2 log(/δ) [ŵ least squares ] j [w ] j (usg the Gaussa tal boud whch s proved the optoal readg). We would lke these equaltes to smultaeously hold for all coordates j. 3

The uo bouds states that for evets E to E k that: Pr(E or E 2... or E k ) j Pr(E j ) Now cosder the followg 2d evets: oe of the above 2 equatos fal for some coordate j. Note that f use δ/2d the above the cumulatve falure probablty s less tha: Pr( ay falure ) j Pr( oe-sded falure for j) 2d(δ/2d) = δ Hece, we have that, wth probablty greater tha δ, that: 2σ2 log(2d/δ) max [ŵ least squares ] j [w ] := j Here, we have that δ s a boud o the cumulatve falure probablty. Also, we have defed the last equalty. Let S be the optmal support set (e.g. the support set of w ). For ay w we have: w w 2 = [w] 2 j + ([w] j [w ] j ) 2 j / S j S Now, for those features j / S, we have [ŵ least squares ] j. Hece, for those features j S, we have that: [ŵ least squares ] j [w ] j 2 To see ths ote that, f [w ] j > 2, the we wll clude ths feature (ad our estmated error s ). If [w ] j 2, the we mght mstakely exclude ths feature ( whch case the above s also true). Hece, Also, as we oly clude at most s features we have: j S ([ŵ least squares ] j [w ] j ) 2 4s 2 j / S ([ŵ least squares ] j ) 2 s 2 Addg these together (ad usg the value of ) completes the proof. 2.2 The Lasso the orthogoal case Let us cosder the case where Σ = I. Note f Σ s smply dagoal, the must rescale the coordates for ths algorthm to work. log d We ca ow argue that usg λ = σ suffces to gve the Lasso a hgh probablty of success. log(d/δ) Theorem 2.2. (Lasso the orthogoal case) Suppose Σ = I. Set λ = 0σ. Let The we have that wth probablty greater tha δ (where c s a uversal costat). ŵ lasso arg m Xw Y 2 + λ w ŵ lasso w 2 c s log(d/δ) Also, ote f we had used the greedy algorthm, we do ot eed to explctly do ths ths rescalg. 4

Proof. Note that: Xw Y 2 + λ w = w w 2 + η X(w w ) + λ w usg that Σ = I ad the defto of Y. Hece, the Lasso s mmzg: w w 2 + η (w w ) + λ w where we have defed η = X η. By assumpto o X we have that η s a N(0, I d ). Hece, the Lasso smplfes to solvg separate. -dmesoal problems. Also, wth probablty greater tha δ, log d/δ each coordate of η s bouded by σ 2. Settg λ to ths value esures that all rrelevat features wll be thresholded to 0. Ad all relevat features wll have ther weght shruk by λ, whch results the regret that s same as the subset selecto algorthm. 3 Whe do the Lasso ad the greedy algorthm also have low regret? There has bee much work showg that the Lasso ad the greedy algorthms ca obta regret bouds comparable to the subset selecto algorthm. These assumptos ca be vewed as relaxatos to the orthogoal codto. Oe weakeg s based o coherece (whch s essetally a assumpto that the features matrx X has propertes smlar to that of a radom matrx). Namely, ths assumpto s that for all coordates j k: where we also assume that: X,j X,k X 2,j = The Restrcted Isometry Property (RIP) s a weaker assumpto tha ths. Uder ether of these assumptos, both the Lasso ad the greedy algorthms ca have rsk bouds comparable to that of the subset selecto algorthm. 5