Overview. Structured learning for feature selection and prediction. Motivation for feature selection. Outline. Part III:

Size: px

Start display at page:

Download "Overview. Structured learning for feature selection and prediction. Motivation for feature selection. Outline. Part III:"

Denis Hunt
5 years ago
Views:

1 Overview Structured learig for feature selectio ad predictio Yookyug Lee Departmet of Statistics The Ohio State Uiversity Part I: Itroductio to Kerel methods Part II: Learig with Reproducig Kerel Hilbert Spaces Part III: Structured learig for feature selectio ad predictio Jauary 22-26, 2007 Witer School at CIMAT Outlie Motivatio for feature selectio Motivatio Feature selectio procedures Geeralizatio of LASSO for kerel methods Structured MSVM with ANOVA decompositio Applicatio Cocludig remarks Key questios i may scietific ivestigatios. Achieve parsimoy (Occam s razor) Etities should ot be multiplied beyod ecessity. Ehace iterpretatio. Ofte reduce variace, hece improve predictio accuracy.

2 Feature selectio procedures LASSO Combiatorial approach: Best subset selectio, Forward selectio, Backward elimiatio, Stepwise regressio e.g. Guyo et al. (2002), Recursive feature selectio l pealty for simultaeous fittig ad selectio: e.g. Bradley ad Magasaria (998), Liear SVM with l pealty Tibshirai (996), LASSO (Least Absolute Shrikage ad Selectio Operator) mi β (y i i= p β j x ij ) 2 + λ β. j= β^ LASSO coefficiet paths Geeralizatio of LASSO Stadardized Coefficiets LASSO Kerel methods may be difficult to iterpret whe the embeddig ito feature space is implicit. Regressio: Li ad Zhag (2003), COmpoet Selectio ad Smoothig Operator Gu ad Kadola (2002), Structural modellig with sparse kerel Classificatio: Zhag (2006) for the biary SVM Lee et al. (2006) for the multiclass SVM beta /max beta

3 Strategy for feature selectio Fuctioal ANOVA decompositio For f defied o a product domai X = p j= X j, Structured represetatio of f usig fuctioal ANOVA decompositio A sparse solutio approach with l pealty A uified treatmet for regressio ad classificatio (both liear ad oliear cases) Iexpesive additioal computatio Systematic elaboratio of f with features f = j [A j + (I A j )]f = ( A j )f + ( A j )(I A i )f j i j i + ( A r )(I A i )(I A j )f + i<j r i,j Fuctioal overall mea + mai effects + two-way iteractios +. Defie A j appropriately so that the decompositio of A j ad I A j correspods to {} H j. ANOVA spaces ad kerels l pealty o θ Wahba (990), smoothig splie ANOVA models Fuctio: f (x) = b + p α= f α(x α ) + α<β f αβ(x α, x β ) + Fuctioal space: f H = p α= ({} H α ), H = {} p α= H α α<β ( H α H β ) Reproducig kerel (r.k.): K (x, x ) = + p α= K α(x, x ) + α<β K αβ(x, x ) + Modificatio of r.k. by rescalig parameters θ 0 K θ (x, x ) = + p α= θ αk α (x, x )+ α<β θ αβk αβ (x, x )+ Trucatig H to F = {} d ν= F ν, fid f (x) F miimizig θ ν P ν f 2. L(y i, f (x i )) + λ i= ν The ˆf (x) = ˆb + ĉi[ d ] i= ν= θ νk ν (x i, x). For sparsity, miimize L(y i, f (x i )) + λ i= ν subject to θ ν 0, ν. θ ν P ν f 2 + λ θ ν θ ν

4 Related to kerel learig Oe-step update for structured regressio Micchelli ad Potil (2005), Learig the kerel fuctio via regularizatio K = {K ν, ν N }: a compact ad covex set of kerels A variatioal problem for optimal kerel cofiguratio ( mi mi K K f H K i= ) L(y i, f (x i )) + λj(f ) Give ˆb ad {ĉ j }, recalibrate θ to miimize + λ ν i= ( y i ˆb θ ν d [ ]) 2 θ ν ĉ j K ν (x j, x i ) ν= j= ĉ i ĉ j K ν (x i, x j ) i,j= subject to θ ν 0, ν, ad ν θ ν s Noegative Garrote SVM whe k > 2 Breima, L. (995), Better Subset Regressio Usig the Noegative Garrote Startig with the full LSE, it both shriks ad zeroes coefficiets. Give ˆβ LS, take (c,..., c p ) to miimize (y i i= p j= c j ˆβ LS j x ij ) 2 subject to c j 0 ad p j= c j s. Geerally lower predictio error tha best subset selectio Lee, Li & Wahba, JASA (2004) y = (y,..., y k ): class code with y j = ad /(k ) elsewhere, if y = j. Fid f = (f,..., f k ) = (b + h (x),..., b k + h k (x)) with h j H K ad the sum-to-zero costrait miimizig i= j y i (f j (x i ) y j i ) + + λ 2 k h j 2. j= Classificatio rule: φ(x) = arg max j [f j (x)]

5 Structured MSVM with ANOVA decompositio /(k ) f j Lee et al., Biometrika (2006) Fid f = (f,..., f k ) = (b + h (x),..., b k + h k (x)) with the sum-to-zero costrait miimizig L(y i ) (f (x i ) y i ) + + λ 2 i= ( k d ) θν P ν h j 2 j= ν= d +λ θ θ ν subject to θ ν 0, for ν =,..., d. ν= Figure: MSVM compoet loss (f j y j ) + where y j = /(k ). L(y): misclassificatio cost By the represeter theorem, ˆf j (x) = ˆb j + [ d ] i= ĉj i ν= θ νk ν (x i, x). Updatig Algorithm Two-way regularizatio Lettig C = ({b j }, {c j i }) ad deotig the objective fuctio by Φ(θ, C), Iitialize θ (0) = (,..., ) t ad C (0) = argmi Φ(θ (0), C). At the m-th iteratio (m =, 2,...) (θ-step) fid θ (m) miimizig Φ(θ, C (m ) ) with C fixed. (c-step) fid C (m) miimizig Φ(θ (m), C) with θ fixed. c-step solutios rage from the simplest majority rule to the complete overfit to data as λ decreases. θ-step solutios rage from the costat model to the full model with all the variables as λ θ decreases. Oe-step update ca be used i practice.

6 Small Roud Blue Cell Tumors of Childhood A sythetic miiature data set Kha et al. (200) i Nature Medicie Tumor types: euroblastoma (NB), rhabdomyosarcoma (RMS), o-hodgki lymphoma (NHL) ad the Ewig family of tumors (EWS). Number of gees : 2308 Class distributio of data set Data set EWS BL(NHL) NB RMS total Traiig set Test set Total It cosists of 00 gees from Kha et al. (63 traiig ad 20 test cases) Use the F-ratio for each gee based o the traiig cases oly. The top 20 gees as variables truly associated with the class. The bottom 80 gees with the class label radomly jumbled as irrelevat variables. 00 replicates by bootstrappig samples from this miiature data set keepig the class proportios the same as the origial data. The proportio of gee iclusio (%) The origial data with 2308 gees Proportio Proportio of iclusio Variable id gee rak Figure: The proportio of iclusio (%) of each gee i the fial classifiers over 00 rus. The dotted lie delimits iformative variables from oiformative oes. 0-fold CV was used for tuig. Figure: The proportio of selectio of each gee i oe-step updated SMSVMs for 00 bootstrap samples. Gees are preseted i the order of margial rak i the origial sample.

7 Summary of the full data aalysis Cumulative umber of gees Proportio of iclusio Figure: The umber of gees selected less ofte tha or as frequetly as a give proportio i 00 rus. The empirical distributio of the umber of gees icluded i oe-step updates cotaied the middle 50% of values betwee 22 ad 228 with media gees were cosistetly selected for more tha 95% of the time. About 2000 gees were selected less tha 20% of the time. Gee selectio led to reductio i test error rates by o average (from to ) with stadard error of It also reduced the variace of test error rates. Cocludig remarks Itegrate feature selectio with kerel methods usig l type pealty. Ehace iterpretatio without compromisig predictio accuracy. Geeral approach for structured ad sparse represetatio with kerels. RKHS methods ca solve a wide rage of statistical learig problems i a pricipled way.

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive