Advanced Machine Learning & Perception

Advanced Machne Learnng & Percepon Insrucor: Tony Jebara

SVM Feaure & Kernel Selecon SVM Eensons Feaure Selecon (Flerng and Wrappng) SVM Feaure Selecon SVM Kernel Selecon

SVM Eensons Classfcaon Feaure/Kernel Selecon O O O O O O O O O O Regresson Mea/Mul-Task Learnng O O O O O O O O O O O O O O Transducon/Sem-supervsed????????? O??????? X???? Mul-Class / Srucured + + + + O O O + O O O O O O

Feaure Selecon & Sparsy Isolaes neresng dmensons of daa for a gven ask Reduces compley of daa Augmens Sparse Vecors (SVMs) wh Sparse Dmensons Can also Improve Generalzaon Eample: fnd subse of d feaures from D dms ha gve larges margn SVM? D = s θ L θ +b =1 s 0,1 { } & s = d D =1 Typcally needs eponenal search: 1000 choose 10 f we consder all possble subses of dmensons How o do hs effcenly (and jonly) wh SVM esmaon? Two classcal approaches: Flerng &Wrappng

Feaure Selecon: Flerng Flerng: fnd/elmnae some feaures before even ranng your classfer (before nducon) as a pre-processng. Wrappng: fnd/elmnae some feaures by evaluang her accuracy afer you ran your classfer (afer nducon). Fsher Informaon Creron: Compue score below for each feaure =1 D. Keep he op d feaures Fsher ( ) = µ + = 1 T + µ + µ ( + σ ) 2 + σ 2 σ + + 2 = 1 T + + + µ Lke pung a Gaussan on each class n each 1 dmenson o compue her spread. The Gaussan assumpon may be wrong! Only measures how lnearly separable daa s. 2

Feaure Selecon: Flerng Pearson Correlaon Coeffcens: score how smlar or redundan wo feaures are. Can hen remove redunances or remove feaures ha are oo correlaed on average. Pearson (, j) = µ ( T + 1)σ σ j ( j µ ) j agan Gaussan only Kolmogorov-Smrnov Tes: non-paramerc, more general han Gaussan bu only 1 feaure a a me. For each feaure, compue he cumulave densy funcon over boh classes hen over he sngle class. Fnd KS score as follows, keep op d feaures. KolmogorovSmrnov = T sup q ˆP { q} ˆP { q y = 1}

Feaure Selecon: Flerng Kolmogorov-Smrnov eample: P ( ) 1, { } ˆP ( ) ˆP ( q) P ( y = 1) 1, y = 1 { } ˆP ( y = 1) ˆP ( q y = 1) KS = T sup q ˆP { q} ˆP { q y = 1}

Feaure Selecon: Wrappng Wrappng: use accuracy of resulng classfer o drve he feaure selecon f ( ) = w T φ( s ) +b Do s elemenwse produc of wh bnary vecor s Noe: more feaures usually mproves ranng accuracy. So, pre-specfy he mamum number (or %) of feaures Or, opmze generalzaon bound (SRM vs. ERM) Margn & Radus Bound (lke VC-bound): E { P } err 1 T E R2 = 1 M 2 T E { R2 W 2 ( α) } Beer Span Bound: (f SV s don change when dong leave-one ou cross-valdaon,.e. removng pon p) T 1 E { P } err 1 T E u T α p 1 p=1 ( 1 K ) SV pp Epecaons over daases u() s sep funcon Ksv s Gram mar of only suppor vecors

SVM Feaure Selecon Margn & Radus Bound: opmze va graden descen Assume selecon vecor s s gven: k, ' Compue R 2 and beas va: R 2 = ma β β k, β β ' k, s.. β, ' ' = 1 β 0 Compue W T W and alphas va: ma α Assume swches are connuous, ake dervaves of R 2 /M 2 : R 2 W 2 = R 2 W 2 R 2 = β s W 2 R 2 +W 2 s k (, ), ' = y y ' α α ' α α α ' y y ' k,, ' ' s..α 0,C,, ' β β ' k, ' k (, ' ) = k ( s, ' s) α y = 0

SVM Feaure Selecon Use chan rule o ge graden of kernel wh respec o s. E.g. RBF kernel k (, ' ) = ep 1 2 s.* s '.* s 2 = ep 1 2 s 2 s j ( j) 2 D ( ' ( j j =1 )) = ep 1 2 s 2 j ( j) 2 D j =1 ( ' ( j) ) ep j 2 s s 2 ( ) ' = ep 1 2 s 2 j ( j) 2 D j =1 ( ' ( j) ) j ep 1 s 2 2 ( ' ( ) ) 2 ( ' ( ) ) 2 ( ) 2

SVM Feaure Selecon Assemble all calculaons o ge graden vecor over s R 2 W 2 = β s, ' k, Gven he old s value,, ' β s β ' k (, ' ) = y s y ' α α ' s = 0 1 1 0 R 2 W 2 W 2 R 2 = R 2 +W 2 = 92.4 k, ' 0.4 0.2 3.2 2.4 he graden s: + 25.4 0.3 3.1 3.5 2.3 Take a small sep o drve down he erm (agans graden) T

SVM Feaure Selecon Synheszed from mure of Gaussan daa Feaure selecon mproves classfer & speeds up

SVM Feaure Selecon Real face & pedesran (wavele) daa (only speedup) Wavele bass:

SVM Kernel Selecon We are gven d=1 D base kernels o use n an SVM k 1 (, ' ),k 2 (, ' ),,k D, ' How do we pck he bes ones or a combnaon of hem? k FINAL, ' = k 4 (, ' ) + k 9 (, ' ) + k 12 (, ' ) I we only had o use 1 kernel, ry D dfferen SVMs To pck 5 ou of 10 kernels, need 10 choose 5 = 252 SVMs! Even worse s pckng a weghed combnaon of kernels where he alpha weghs are posve k FINAL, ' D = α k (, ' ) =1 Defne he algnmen beween wo kernel marces as A( K 1,K 2 ) = K 1,K 2 N where K 1,K 2 = k 1, ( j )k 2,,j =1 j K 1,K 1 K 2,K 2

SVM Kernel Selecon We wan a kernel mar K ha algns wh he labels mar ma K A K,yy T Ths can be wren equvalenly as he soluon below: ma K K,yy T s.. K,K = 1,K 0 Ths can all be wren as a semdefne program (SDP) ma K K,yy T s.. A K T 0 0 K I 0 0 0 0 1 r ( A) 0 0 0 0 K Unforunaely, hs can gve a rval soluon 0

SVM Kernel Selecon Insead, force K o be a conc combnaon of base kernels: ma K s.. K,yy T A K T 0 0 K I 0 0 0 0 1 r ( A) 0 0 0 0 K 0 PLUS... K = =1 α K Ths s smpler han an SDP, jus a second order cone program (faser code) D

Feaure vs. Kernel Selecon Lnear feaure selecon can be done va kernel selecon! f va where only a few s values are 1 and mos are zero Defne he base kernels o be: = w T s +b k 1 (, ' ),k 2 (, ' ),,k D, ', ' k = ( ) '( ) K = s K For eample, n a lnear SVM he classfer s: f ( ) = α y k ( FINAL, ) +b = α y s k (, ) +b = α y s ( ) ( ) +b = w T ( s ) +b D =1