The Kullback-Leibler Kernel as a Framewrk fr Discriminant and Lcalized Representatins fr Visual Recgnitin Nun Vascncels Purdy H Pedr Mren ECE Department University f Califrnia, San Dieg HP Labs Cambridge Research Labratry
Classificatin architectures fr visin mdern learning thery favrs discriminant ver generative architectures fr classificatin fr visin, a fundamental difference is the set f cnstraints impsed n the representatin discriminant classifiers favr hlistic representatins (image as a pint in high-dimensinal space) generative classifiers favr lcalized representatins (images as bags f lcal features) lcalized representatins have varius advantages mre invariant mre rbust t cclusin lwer dimensinality 2
Classificatin architectures fr visin als, despite weaker guarantees, generative architectures have great practical appeal better scalability in number f classes encding f prir knwledge in the frm f statistical mdels mdular slutins, using Bayesian inference Q: can all this be cmbined with discriminant guarantees? we cnsider SVMs, and the Kullback-Leibler kernel investigate its ability t seamlessly cmbine discriminant recgnitin with generative mdels based n lcalized representatins 3
Supprt vectr machines 4
5 Kernels 1 2 1 3 2 n Kernel-based feature transfrmatin
Cnstraints n representatin tw pssible image representatins hlistic:the space f all images, each image is a pint in lcalized: image brken int lcal windws, the space f such windws SVM training is O(N 2 ), N = number f eamples hlistic: N = I, dim() large (e.g. 5,000) lcalized: N = k I, I = # images k = windws/image, k large (e.g. 5,000) dim() small (e.g. 88) cmpleity f lcalized ~ k 2 times that f hlistic (e.g. 2.5 10 7 increase) n way t capture gruping f windws int images 6
Cnstraints n representatin lcalized nt suited fr traditinal SVM hlistic has been successful, but has limitatins reslutin: images t high-dimensinal, drastically dwn-sampled (e.g. frm 240360 t 2020) discarded infrmatin imprtant fr fine classificatin (near bundary) invariance: images as pints span quite cnvluted maniflds in X, when subject t transfrmatins cclusin: a few ccluded piels can lead t a very large jump in X 7
Lcalized representatins reslutin is n issue (simply mre pints per image) greater rbustness t invariance, e.g. greater rbustness t cclusin X% ccluded piels, means that % f the prbability mass changes the remaining (1-)% shuld still be enugh t btain a gd match when % f a vectr cmpnents change, matching is hard 8
Prbabilistic kernels since kernel captures similarities between eamples bags f lcalized eamples best described by their prb. density natural t make the kernel functin a measure f distance between prbability density functins varius kernels prpsed in the literature Fischer Kernel (Jaakkla et al, 1999), TOP kernel (Tsuda et al, 2002), diffusin kernels (Lafferty and Lebann, 2002), generalized crrelatin kernel (Kndr, Jebara, 2003), KL-kernel (Mren et al, 2003) 9
Prbabilistic kernels three main advantages ver hlistic kernels enable representatins f variable length enable a cmpact representatin f a large sequence f vectrs (thrugh pdf) can eplit prir knwledge abut the classificatin prblem (selectin f suitable prbability mdels) interpretatin f the standard Gaussian kernel as where d is the Euclidean distance suggests a natural etensin based n pdf distances this leads t the KL kernel 10
The KL kernel relies n the (symmetric) Kullback-Leibler divergence as the measure f pdf distance 11
A kernel tanmy prbabilistic kernels allw a great deal f fleibility ver traditinal cunterparts KL kernel can be tuned t the prblem in terms f: 1. perfrmance: chice f prbability mdels that match the statistics f the data 2. cmputatin: using apprimatins t the KL that have been shwn t wrk well in certain dmains 3. jint design f features and kernel here we fcus n 1 and 2, stay tuned fr 3 it is pssible t develp a tanmy f kernels that implement varius trade-ffs between perfrmance and cmputatin 12
Parametric densities are gd mdels r apprimatins fr varius prblems the kernel can be tailred t the particular pdf family 13
The Gaussian is a particularly ppular case in general, it is pssible t derive the kernel functin fr the parametric cases 14
Nn-parametric densities nn-parametric density mdels can be a lt trickier sme have clsed-frm KL kernels, e.g. the histgram etensins available fr histgrams defined n different partitins (Vascncels, Trans. Inf. Thery, 2004) 15
Apprimatins varius are pssible fr kernels withut clsed frm in sme cases, even this has n clsed-frm, e.g. 16
Apprimatins and sampling varius specific apprimatins have been recently prpsed fr the Gauss miture case lg-sum bund (Singer and Warmuth, NIPS 98) asympttic likelihd apprimatin (Vascncels, ICCV 2001, trans. IT, 2004) unscented transfrmatin (Gldberger et al, ICCV 2004) finally, ne can always use Mnte Carl sampling 17
Eperiments all n COIL-100, three reslutins: 3232, 6464, 128128 4 different cmbinatins f train/test: I images f each bject used as training set, I in {4, 8,18,36} remaining used fr test dataset with I = n referred t as n hlistic representatin: each image ne vectr lcalized representatin: image as feature bag: etract 88 windws, cmpute DCT, keep 32 first features miture f 16 Gaussians fit t each image 18
COIL-100 100 bjects subject t 3D rtatin ne view every 5 19
Results hlistic: SVM with three different kernels linear (L-SVM), plynmial rder 2 (P2-SVM), Gaussian (G-SVM) lcalized: standard maimum-likelihd Gauss miture classifier KL kernel with Gauss miture mdels (KL-SVM) recgnitin rates (%) 20
Results 4 8 21
Results 18 36 22
Observatins hlistic kernels: G-SVM clearly better ecellent when n is large, but drps quickly fr small n weaker than GMM! verall: lcalized + discriminant (KL-SVM) is best differences between KL-SVM and G-SVM as high as 10% lcalized + weak (GMM) learner better than hlistic + strng (G-SVM) cnclusins: lcalized is mre invariant, leads t easier classificatin prblem: weaker classifier (GMM) has better generalizatin! reslutin (higher dimensinality vs mre image inf): lsses f abut 5% at lwer reslutin KL-SVM much mre rbust than GMM 23
Fleibility discriminant attributes fr recgnitin depend n task (e.g. shape better fr digits, teture better fr landscapes) KL kernel supprts multiple representatins cmparisn f representatins based n supprt: pint-wise vs lcal appearance vs glbal appearance clr: grayscale vs clr all eperiments n 4, 128128, cmpared pint-wise:kl-kernel ( 2 )+ histgram (16 bins/channel) histgram intersectin (Laplacian kernel, Chapelle et al, trans. Neural Nets, 1999) lcal: glbal: KL-kernel with GMM (88 windws) G-SVM 24
Results clr imprtant cue fr recgnitin n COIL the less lcalized the better: pint-wise > lcal >> glbal lcalizatin/invariance trade-ff: clr s discriminant that even invariance lss f 88 is t much lss f hlistic is s large that it perfrms quite prly n grayscale (less discriminant) lcalized des best cnclusin: different representatins perfrm best n different tasks, fleibility f KL-kernel is a great asset 25