Dimensionality reduction Feature selection

CS 750 Mache Learg Lecture 3 Dmesoalty reducto Feature selecto Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square CS 750 Mache Learg Dmesoalty reducto. Motvato. Classfcato problem eample: We have a put data { 1,,.., N } such that 1 d = (,,.., ) ad a set of correspodg output labels { y1, y,.., y N } Assume the dmeso d of the data pot s very large We wat to classfy Problems wth hgh dmesoal put vectors A large umber of parameters to lear, f a dataset s small ths ca result : Large varace of estmates ad overft t becomes hard to epla what features are mportat the model (too may choces some ca be substtutable) CS 750 Mache Learg

Dmesoalty reducto. Solutos: Selecto of a smaller subset of puts (features) from a large set of puts; tra classfer o the reduced put set Combato of hgh dmesoal puts to a smaller set of features () ; tra classfer o ew features φ k selecto combato CS 750 Mache Learg Feature selecto How to fd a good subset of puts/features? We eed: A crtero for rakg good puts/features Search procedure for fdg a good set of features Feature selecto process ca be: Depedet o the learg task e.g. classfcato Selecto of features affected by what we wat to predct Idepedet of the learg task puts are reduced wthout lookg at the output PCA, compoet aalyss, clusterg of puts may lack the accuracy for classfcato/regresso tasks CS 750 Mache Learg

Task-depedet feature selecto Assume: Classfcato problem: put vector, y - output Feature mappgs φ = { φ1( ), φ ( ), K φ k ( ), K} Objectve: Fd a subset of features that gves/preserves most of the output predcto capabltes Selecto approaches: Flterg approaches Flter out features wth small predctve potetal doe before classfcato; typcally uses uvarate aalyss Wrapper approaches Select features that drectly optmze the accuracy of the multvarate classfer Embedded methods Feature selecto ad learg closely ted the method CS 750 Mache Learg Feature selecto through flterg Assume: Classfcato problem: put vector, y - output Iputs or feature mappgs () φ k How to select the feature: Uvarate aalyss Preted that oly oe varable, k, ests See how well t predcts the output y aloe Eample: dfferetally epressed features (or puts) Good separato bary (case/cotrol settgs) CS 750 Mache Learg

Dfferetally epressed features Crtera for measurg the dfferetal epresso T-Test score (Bald & Log) Based o the test that two groups come from the same populato ( + ) ( ) Fsher Score µ µ Fsher ( ) = ( + ) ( ) σ + σ Area uder Recever Operatg Characterstc (AUC) score Problems: f may radom features, the features wth a good dfferetally epressed score must arse Techques to reduce FDR (False dscovery rate) ad FWER (Famly wse error). CS 750 Mache Learg Feature flterg Other uvarate scores: Cov ( φk, y) Correlato coeffcets ρ ( φk, y) = Var ( φk ) Var ( y ) Measures lear depedeces Mutual formato ~ ~ ~ P( φk = j, y = ) I ( φk, y) = P( φk = j, y = ) log ~ P( φ = j) P( y = ) j Uvarate assumptos: Oly oe feature ad ts effect o y s corporated the mutual formato score Effects of two features o y are depedet What to do f the combato of features gves the best predcto? k CS 750 Mache Learg

Feature selecto: depedet features Flterg wth depedet features Let φ be a curret set of features (startg from complete set) We ca remove feature ~ φ k () from t whe: ~ P ( y φ \ φ k ) P ( y φ) for all values of φ k, y Repeat removals utl the probabltes dffer too much. Problem: how to compute/estmate P ~ ( y φ \ φ ~ k ), P ( y φ)? Soluto: make some smplfyg assumpto about the uderlyg probablstc model Eample: use a Naïve Bayes Advatage: speed, modularty, appled before classfcato Dsadvatage: may ot be as accurate CS 750 Mache Learg Feature selecto: wrappers Wrapper approach: The feature selecto s drve by the predcto accuracy of the classfer (regressor) actually bult How to fd the approprate feature set? Idea: Greedy search the space of classfers Gradually add features mprovg most the qualty score Gradually remove features that effect the accuracy the least Score should reflect the accuracy of the classfer (error) ad also prevet overft Stadard way to measure the qualty: Iteral cross-valdato (m-fold cross valdato) CS 750 Mache Learg

Feature selecto: wrappers Eample of a greedy (forward) search: logstc regresso model wth features Start wth p ( y = 1, w) = g( wo ) Choose the feature φ () wth the best score p( y = 1, w) = g( w o + w φ ( )) Choose the feature φ j () wth the best score p( y = 1, w) = g( w + w φ ( ) + w φ ( )) Etc. Whe to stop? o j j CS 750 Mache Learg Iteral cross-valdato Goal: Stop the learg whe smallest geeralzato error (performace o the populato from whch data were draw) Test set ca be used to estmate geeralzato error Data dfferet from the trag set Iteral valdato set = test set used to stop the learg process E.g. feature selecto process Cross-valdato (m-fold): Dvde the data to m equal parttos (of sze N/m) Hold out oe partto for valdato, tra the classfer o the rest of data Repeat such that every partto s held out oce The estmate of the geeralzato error of the learer s the mea of errors of all classfers CS 750 Mache Learg

Embedded methods Feature selecto + classfcato model learg doe together Embedded models: Regularzed models Models of hgher complety are eplctly pealzed leadg to vrtual removal of puts from the model Regularzed logstc/lear regresso Support vector maches Optmzato of margs pealzes ozero weghts CART/Decso trees CS 750 Mache Learg Prcpal compoet aalyss (PCA) Objectve: We wat to replace a hgh dmesoal put wth a small set of features (obtaed by combg puts) Dfferet from the feature subset selecto!!! PCA: A lear trasformato of d dmesoal put to M dmesoal feature vector z such that M < d uder whch the retaed varace s mamal. Equvaletly t s the lear projecto for whch the sum of squares recostructo cost s mmzed. CS 750 Mache Learg

PCA 40 0 0-0 30 40 0-40 40 30 0 10 0-10 -0-30 -30-0 -10 0 10 CS 750 Mache Learg PCA CS 750 Mache Learg

PCA CS 750 Mache Learg PCA 40 30 Xprm=0.04+ 0.06y- 0.99z Yprm=0.70+0.70y+0.07z 97% varace retaed 0 10 Yprm 0-10 -0-30 -40-40 -30-0 -10 0 10 0 30 40 50 Xprm CS 750 Mache Learg

Prcpal compoet aalyss (PCA) PCA: lear trasformato of d dmesoal put to M dmesoal feature vector z such that M < d uder whch the retaed varace s mamal. Task depedet Fact: A vector ca be represeted usg a set of orthoormal d vectors u = z u = 1 Leads to trasformato of coordates (from to z usg u s) z u T = CS 750 Mache Learg PCA Idea: replace d coordates wth M of z coordates to represet. We wat to fd the subset M of bass vectors. M ~ = z u + b u = 1 d = M + 1 b - costat ad fed How to choose the best set of bass vectors? We wat the subset that gves the best appromato of data the dataset o average (we use least squares ft) Error for data etry E M = 1 N = 1 ~ = CS 750 Mache Learg 1 d ~ = ( z b ) u N = 1 d = M + 1 = M + 1 ( z b )

PCA Dfferetate the error fucto wth regard to all b ad set equal to 0 we get: N 1 b u T = z = N =1 The we ca rewrte: T E M = u Σu Σ = ( )( ) = M + 1 = 1 The error fucto s optmzed whe bass vectors satsfy: d 1 Σu = λ u E M = λ The best M bass vectors: dscard vectors wth d-m smallest egevalues (or keep vectors wth M largest egevalues) Egevector s called a prcpal compoet CS 750 Mache Learg = 1 N N = 1 1 d N T u = M + 1 u PCA Oce egevectors wth largest egevalues are detfed, they are used to trasform the orgal d-dmesoal data to M dmesos u u 1 To fd the true dmesoalty of the data d we ca just look at egevalues that cotrbute the most (small egevalues are dsregarded) Problem: PCA s a lear method. The true dmesoalty ca be overestmated. There ca be o-lear correlatos. 1 CS 750 Mache Learg

Dmesoalty reducto wth eural ets PCA s lmted to lear dmesoalty reducto To do o-lear reductos we ca use eural ets Auto-assocatve etwork: a eural etwork wth the same puts ad outputs ( ) 1 d z = ( z 1, z ) 1 d The mddle layer correspods to the reduced dmesos CS 750 Mache Learg Dmesoalty reducto wth eural ets Error crtero: 1 E = N d ( y ( ) ) = 1 = 1 Error measure tres to recover the orgal data through lmted umber of dmesos the mddle layer No-leartes modeled through termedate layers betwee the mddle layer ad put/output If o termedate layers are used the model replcates PCA optmzato through learg 1 d z = ( z 1, z 1 d ) CS 750 Mache Learg

Dmesoalty reducto through clusterg Clusterg algorthms group together smlar staces the data sample Dmesoalty reducto based o clusterg: Replace a hgh dmesoal data etry wth a cluster label Problem: Determstc clusterg gves oly oe label per put May ot be eough to represet the data for predcto Solutos: Clusterg over subsets of put data Soft clusterg (probablty of a cluster s used drectly) CS 750 Mache Learg Dmesoalty reducto through clusterg Soft clusterg (e.g. mture of Gaussas) attempts to cover all staces the data sample wth a small umber of groups Each group s more or less resposble for a data etry (resposblty a posteror of a group gve the data etry) Mture of G. resposblty = k Dmesoalty reducto based o soft clusterg Replace a hgh dmesoal data wth the set of group posterors Feed all posterors to the learer e.g. lear regressor, classfer h l π p ( u = 1 π p ( u l y l l y = ) l = u ) CS 750 Mache Learg

Dmesoalty reducto through clusterg We ca use the dea of soft clusterg before applyg regresso/classfcato learg Two stage algorthms Lear the clusterg Lear the classfcato Iput clusterg: (hgh dmesoal) Output clusterg (Iput classfer): p ( c = ) Output classfer: y Eample: Networks wth Radal Bass Fuctos (RBFs) Problem: Clusterg lears based o p() (dsregards the target) Predcto based o p( y ) CS 750 Mache Learg Networks wth radal bass fuctos A alteratve to multlayer NN for o-leartes k Radal bass fuctos: f ( ) = w + φ ( ) Based o terpolatos of prototype pots (meas) Affected by the dstace betwee the ad the mea Ft the outputs of bass fuctos through the lear model Choce of bass fuctos: µ j Gaussa φ j ( ) = ep σ j Learg: I practce seem to work OK for up to 10 dmesos For hgher dmesos (rdge fuctos logstc) combg multple learers seem to do better job CS 750 Mache Learg 0 w j j j= 1